Evaluation of MathServe at CASC-20
MathServe took part in the demonstration division of the
CADE ATP System
Competition (CASC-20), 2005, in Tallin. For training the system we
collected
performance data of
the ATP systems Ep, Otter, SPASS, and Vampire on the
TPTP Library version 3.0.1. An
XML
representation of our data is available
here. Here
you can see how the MathServe system performed compared to the top ATP systems
Ep and Vampire:
System
|
Problems
Given
|
Problems
Solved
|
Percentage
of All
|
MathServe 0.62
|
660
|
392
|
59.4
|
Ep 0.9pre3
|
660
|
409
|
62.0
|
Vampire 8.0
|
540
|
430
|
65.2
|
The fact that MathServe did not outperform the systems Ep and Vampire is mainly
due to the significant improvement of the provers for the competition. These
improved systems were not available for MathServe at the time of the
competition. A detailed discussion of MathServe's perfomance on CASC and in
further experiments can be found in
this document
[PDF].
MathServe at CASC-J3
An improved version of MathServe took part in the demonstration division of the
2006 CASC
(CASC-J3). We added the systems DCTP, Paradox and Waldmeister, and improved MathServe's
handling of very large problems (>2MB). Here you can see how the MathServe
system performed compared to the top ATP systems Ep 0.99 and Vampire 8.1:
System
|
Problems
Given
|
Problems
Solved
|
Percentage
of All
|
MathServe 0.8
|
600
|
414
|
69
|
Ep 0.99
|
660
|
402
|
67
|
Vampire 8.1
|
500
|
412
|
68.7
|
Overall, MathServe solved more problems than
Ep and Vampire. However, on the problems of the SAT competition division,
MathServe performed poorly (11 problems solved out of 100). This is due to the fact that
MathServe is not told what to achieve for a given problem, so it simply
assumes that it has to determine unsatisfiability for CNF problems and
theoremhood for FOF problems. MathServe chooses reasoning systems according to
that assumption.
Contact:

Last modified: Fri Sep 22 16:58:03 CEST 2006