Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks:eu [2011/11/29 09:24] zeman Documentation of part of speech tags. |
user:zeman:treebanks:eu [2011/11/29 10:25] zeman Inside. |
||
---|---|---|---|
Line 6: | Line 6: | ||
* CoNLL 2007 | * CoNLL 2007 | ||
- | * Extended version | + | * BDT-II |
==== Obtaining and License ==== | ==== Obtaining and License ==== | ||
There does not seem to be any regular distribution channel for the Basque Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Koldo Gojenola (koldo (dot) gojenola (at) ehu (dot) es) about availability of the corpus. | There does not seem to be any regular distribution channel for the Basque Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Koldo Gojenola (koldo (dot) gojenola (at) ehu (dot) es) about availability of the corpus. | ||
+ | |||
+ | Informally agreed upon terms: | ||
+ | * no redistribution | ||
+ | * cite the principal publication (see below) in publications | ||
BDT was created by members of the [[http:// | BDT was created by members of the [[http:// | ||
Line 28: | Line 32: | ||
==== Domain ==== | ==== Domain ==== | ||
- | Mixed (“GDT consists of randomly selected textual fragments | + | Newswire + unknown |
==== Size ==== | ==== Size ==== | ||
- | The CoNLL 2007 version contains 70223 tokens in 2902 sentences, yielding 24.20 tokens per sentence on average (CoNLL 2007 data split: 65419 tokens / 2705 sentences | + | The CoNLL 2007 dataset was officially |
- | ==== Inside ==== | + | ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ |
+ | | CoNLL 2007 | 3190 | 50526 | 334 | 5390 | | ||
+ | | BDT-II | 9094 | 124,684 | 1010 | 12625 | 1122 | 14295 | 11226 | 151,604 | 13.50 | | ||
- | The syntactic annotation style and the tagset for dependency relations (analytical functions) in GDT has been modeled after the [[http:// | + | ==== Inside ==== |
Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!): | Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!): | ||
Line 91: | Line 97: | ||
* ASP = aspect | * ASP = aspect | ||
* ERL = relation (relative sentence, completive sentence, indirect question...) | * ERL = relation (relative sentence, completive sentence, indirect question...) | ||
+ | |||
+ | The syntactic guidelines (structure and labels) are described in Spanish in this [[http:// | ||
==== Sample ==== | ==== Sample ==== |