Evaluation of MathServe at CASC-20

MathServe took part in the demonstration division of the CADE ATP System Competition (CASC-20), 2005, in Tallin. For training the system we collected performance data of the ATP systems Ep, Otter, SPASS, and Vampire on the TPTP Library version 3.0.1. An XML representation of our data is available here.
Here you can see how the MathServe system performed compared to the top ATP systems Ep and Vampire:

System
Problems Given
Problems Solved
Percentage of All
MathServe 0.62
660
392
59.4
Ep 0.9pre3
660
409
62.0
Vampire 8.0
540
430
65.2

The fact that MathServe did not outperform the systems Ep and Vampire is mainly due to the significant improvement of the provers for the competition. These improved systems were not available for MathServe at the time of the competition. A detailed discussion of MathServe's perfomance on CASC and in further experiments can be found in this document [PDF].

MathServe at CASC-J3

An improved version of MathServe took part in the demonstration division of the 2006 CASC (CASC-J3). We added the systems DCTP, Paradox and Waldmeister, and improved MathServe's handling of very large problems (>2MB). Here you can see how the MathServe system performed compared to the top ATP systems Ep 0.99 and Vampire 8.1:

System
Problems Given
Problems Solved
Percentage of All
MathServe 0.8
600
414
69
Ep 0.99
660
402
67
Vampire 8.1
500
412
68.7

Overall, MathServe solved more problems than Ep and Vampire. However, on the problems of the SAT competition division, MathServe performed poorly (11 problems solved out of 100). This is due to the fact that MathServe is not told what to achieve for a given problem, so it simply assumes that it has to determine unsatisfiability for CNF problems and theoremhood for FOF problems. MathServe chooses reasoning systems according to that assumption.


Contact:
Last modified: Fri Sep 22 16:58:03 CEST 2006