[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

name = CoNLL Dependency Treebanks | owner = zeman | path = /fs/clip-corpora/conll | version = 2006

CoNLL-X Shared Task involved dependency parsing of the following languages:

  1. Arabic (Prague Arabic Dependency Treebank)
  2. Bulgarian (BulTreeBank)
  3. Chinese (Sinica Treebank)
  4. Czech (Prague Dependency Treebank)
  5. Dutch (Alpino Treebank)
  6. German (Tiger)
  7. Japanese (Verbmobil)
  8. Portuguese (Bosque)
  9. Slovene (Slovene Dependency Treebank)
  10. Spanish (Cast3LB)
  11. Swedish (Talbanken05)
  12. Turkish (METU-Sabanci Treebank)

Now there are standardized data sets for all these languages in unified format. Note that the training/test splits differ from those that the treebanks may define out of CoNLL context (because the organizers of the shared task needed to keep the test data secret until the D date).

Some treebanks are freely available (da, nl, pt, sv), some are freely available after signing a licence agreement (bg, ja, sl), some require LDC licence (ar, cs), some require their own separate licence (de, tr, zh).

I have been able to acquire ar, bg, zh, cs, da, nl, ja, pt, sl, sv. Except for da, nl, pt, sv, do not redistribute without talking to Dan Zeman.


[ Back to the navigation ] [ Back to the content ]