We discussed the answers to the questions most of the time. Other issues raised in the discussion were:
Usability – the approach is only usable for reranking the output of some other parser.
Scalability – they only use 800 sentences and 20 candidates per sentence for training. We believe that for large data (milions of examples) this will become too complex.
Evaluation – it looks as if they used a non-standard evaluation metric to get “better” results. The standard here would be F1-score.