name = CoNLL Dependency Treebanks | owner = zeman | path = /fs/clip-corpora/conll | version = 2006
CoNLL-X Shared Task involved dependency parsing of the following languages:
- Arabic (Prague Arabic Dependency Treebank)
- Bulgarian (BulTreeBank)
- Chinese (Sinica Treebank)
- Czech (Prague Dependency Treebank)
- Danish (Danish Dependency Treebank)
- Dutch (Alpino Treebank)
- German (Tiger)
- Japanese (Verbmobil)
- Portuguese (Bosque)
- Slovene (Slovene Dependency Treebank)
- Spanish (Cast3LB)
- Swedish (Talbanken05)
- Turkish (METU-Sabanci Treebank)
Now there are standardized data sets for all these languages in unified format. Note that the training/test splits differ from those that the treebanks may define out of CoNLL context (because the organizers of the shared task needed to keep the test data secret until the D date).
Some treebanks are freely available (da, nl, pt, sv), some are freely available after signing a licence agreement (bg, ja, sl), some require LDC licence (ar, cs), some require their own separate licence (de, tr, zh).
I have been able to acquire ar, bg, zh, cs, da, nl, ja, pt, sl, sv. Except for da, nl, pt, sv, do not redistribute without talking to Dan Zeman.