Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks [2011/11/18 15:30] zeman Danish inside. |
user:zeman:treebanks [2011/11/19 13:08] zeman Greek sample. |
||
---|---|---|---|
Line 189: | Line 189: | ||
==== Inside ==== | ==== Inside ==== | ||
- | The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) columns | + | The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http:// |
The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. | The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. | ||
Line 1179: | Line 1179: | ||
==== Versions ==== | ==== Versions ==== | ||
- | * Original DDT 1.0 in [[http:// | + | * Original DDT 1.0 in the [[http:// |
* CoNLL 2006 | * CoNLL 2006 | ||
Line 1217: | Line 1217: | ||
==== Inside ==== | ==== Inside ==== | ||
- | The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) columns | + | The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http:// |
- | The morphosyntactic tags have been assigned (probably) manually. | + | The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). |
+ | |||
+ | Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities. | ||
==== Sample ==== | ==== Sample ==== | ||
- | The first three sentences | + | The first sentence |
- | | 1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT | | + | <code xml>< |
- | | 2 | трета | _ | M | Mo | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod | | + | < |
- | | |||||||||| | + | |
- | | 1 | НАРОДНО | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | | + | < |
- | | 2 | СЪБРАНИЕ | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | + | < |
- | | |||||||||| | + | </ |
- | | 1 | Народното | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod | | + | <extent words=158>158 running words</extent> |
- | | 2 | събрание | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 3 | subj | 3 | subj | | + | < |
- | | 3 | осъществява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 0 | ROOT | 0 | ROOT | | + | < |
- | | 4 | законодателната | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 5 | mod | 5 | mod | | + | < |
- | | 5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj | | + | |
- | | 6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj | | + | < |
- | | 7 | упражнява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 3 | conjarg | + | </ |
- | | 8 | парламентарен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 9 | mod | 9 | mod | | + | < |
- | | 9 | контрол | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 7 | obj | 7 | obj | | + | < |
- | | 10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct | | + | < |
+ | < | ||
+ | <author gender=m born=1925> | ||
+ | | ||
+ | < | ||
+ | <imprint>< | ||
+ | | ||
+ | <date>1992-12-01</date> | ||
+ | | ||
+ | | ||
+ | </biblStruct> | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | < | ||
+ | <catRef target=" | ||
+ | < | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | <text id=AJK> | ||
+ | < | ||
+ | <div1 type=main> | ||
+ | <p> | ||
+ | <s> | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma="," | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma=" | ||
+ | <W lemma="& | ||
+ | <W lemma=" | ||
+ | <W lemma="& | ||
+ | <W lemma=" | ||
+ | </ | ||
- | The first three sentences | + | The first sentence |
- | | 1 | Единственото | + | | 1 | Samme | _ | A | AN | degree=pos< |
- | | 2 | решение | + | | 2 | cifre | _ | N | NC | gender=neuter< |
- | | |||||||||| | + | | 3 | , | _ | X | XP | _ | 1 | pnct | _ | _ | |
- | | 1 | Ерик | + | | 4 | de | _ | P | PD | gender=common/ |
- | | 2 | Франк | + | | 5 | norske | _ | A | AN | degree=pos< |
- | | 3 | Ръсел | + | | 6 | piger | _ | N | NC | gender=common< |
- | | |||||||||| | + | | 7 | tabte | _ | V | VA | mood=indic< |
- | | 1 | Пълен | + | | 8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ | |
- | | 2 | мрак | + | | 9 | i_lørdags | _ | RG | RG | degree=unmarked | 7 | mod | _ | _ | |
- | | 3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj | | + | | 10 | mod | _ | SP | SP | _ | 7 | pobj | _ | _ | |
- | | 4 | пълна | + | | 11 | VMs | _ | N | NP | case=gen | 10 | nobj | _ | _ | |
- | | 5 | самота | + | | 12 | værtsnation | _ | N | NC | gender=common< |
- | | 6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct | | + | | 13 | . | _ | X | XP | _ | 1 | pnct | _ | _ | |
+ | |||
+ | The first sentence of the CoNLL 2006 test data: | ||
+ | |||
+ | | 1 | To | _ | A | AC | case=unmarked | 10 | subj | _ | _ | | ||
+ | | 2 | kendte | _ | A | AN | degree=pos< | ||
+ | | 3 | russiske | _ | A | AN | degree=pos< | ||
+ | | 4 | historikere | ||
+ | | 5 | Andronik | _ | N | NP | case=unmarked | 6 | namef | _ | _ | | ||
+ | | 6 | Mirganjan | _ | N | NP | case=unmarked | ||
+ | | 7 | og | _ | C | CC | _ | 6 | coord | _ | _ | | ||
+ | | 8 | Igor | _ | N | NP | case=unmarked | 9 | namef | _ | _ | | ||
+ | | 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ | | ||
+ | | 10 | tror | _ | V | VA | mood=indic< | ||
+ | | 11 | ikke | _ | RG | RG | degree=unmarked | 10 | mod | _ | _ | | ||
+ | | 12 | , | _ | X | XP | _ | 10 | pnct | _ | _ | | ||
+ | | 13 | at | _ | C | CS | _ | 10 | dobj | _ | _ | | ||
+ | | 14 | Rusland | _ | N | NP | case=unmarked | 15 | subj | _ | _ | | ||
+ | | 15 | kan | _ | V | VA | mood=indic< | ||
+ | | 16 | udvikles | _ | V | VA | mood=infin< | ||
+ | | 17 | uden | _ | SP | SP | _ | 15 | mod | _ | _ | | ||
+ | | 18 | en | _ | P | PI | gender=common< | ||
+ | | 19 | " | _ | X | XP | _ | 20 | pnct | _ | _ | | ||
+ | | 20 | jernnæve | ||
+ | | 21 | " | _ | X | XP | _ | 20 | pnct | _ | _ | | ||
+ | | 22 | . | _ | X | XP | _ | 10 | pnct | _ | _ | | ||
+ | |||
+ | ==== Parsing ==== | ||
+ | |||
+ | Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%). | ||
+ | |||
+ | The results of the CoNLL 2006 shared task are [[http:// | ||
+ | |||
+ | ^ Parser (Authors) ^ LAS ^ UAS ^ | ||
+ | | MST (McDonald et al.) | 84.79 | 90.58 | | ||
+ | | Malt (Nivre et al.) | 84.77 | 89.80 | | ||
+ | | Riedel et al. | 83.63 | 89.66 | | ||
+ | |||
+ | ===== German (de) ===== | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | ==== Versions ==== | ||
+ | |||
+ | * TIGER Treebank 1 (2003) | ||
+ | * TIGER Treebank 2 (2005) | ||
+ | * TIGER Treebank 2.1 (2007) in [[http:// | ||
+ | * CoNLL 2006 | ||
+ | * CoNLL 2009 | ||
+ | |||
+ | ==== Obtaining and License ==== | ||
+ | |||
+ | The TIGER Treebank is freely downloadable after you accept the [[http:// | ||
+ | |||
+ | Republication of the two CoNLL versions in LDC is planned but it has not happenned yet. | ||
+ | |||
+ | The license in short: | ||
+ | |||
+ | * non-commercial research and evaluation usage by academic or educational institutions | ||
+ | * no redistribution | ||
+ | * acknowledge the use of the corpus in publications | ||
+ | |||
+ | The TIGER Treebank was created by members of three institutes: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | ==== References ==== | ||
+ | |||
+ | * Website | ||
+ | * http:// | ||
+ | * Data | ||
+ | * //no separate citation// | ||
+ | * Principal publications | ||
+ | * Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, George Smith: [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * Berthold Crysmann, Silvia Hansen-Schirra, | ||
+ | * Stefanie Albert, Jan Anderssen, Regine Bader, Stephanie Becker, Tobias Bracht, Sabine Brants, Thorsten Brants, Vera Demberg, Stefanie Dipper, Peter Eisenberg, Silvia Hansen, Hagen Hirschmann, Juliane Janitzek, Carolin Kirstein, Robert Langner, Lukas Michelbacher, | ||
+ | * The header of the XML version of the TIGER Treebank contains lists of various sorts of tags with brief explanation. | ||
+ | |||
+ | ==== Domain ==== | ||
+ | |||
+ | Mostly newswire (Frankfurter Rundschau). | ||
+ | |||
+ | ==== Size ==== | ||
+ | |||
+ | According to their website, the TIGER Treebank version 1 contains approximately 700,000 tokens in 40,000 sentences. Version 2.1 contains approximately 900,000 tokens in 50,000 sentences. | ||
+ | |||
+ | The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens / 39216 sentences training, 5694 tokens / 357 sentences test). | ||
+ | |||
+ | The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences training, 32033 tokens / 2000 sentences development, | ||
+ | |||
+ | ==== Inside ==== | ||
+ | |||
+ | All versions contain // | ||
+ | |||
+ | It is not clear what the // | ||
+ | |||
+ | The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format. | ||
+ | |||
+ | ==== Sample ==== | ||
+ | |||
+ | The first sentence of TIGER Treebank 2.1 in the TIGER-XML format: | ||
+ | |||
+ | <code xml>< | ||
+ | <graph root=" | ||
+ | < | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | <t id=" | ||
+ | </ | ||
+ | < | ||
+ | <nt id=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | </ | ||
+ | <nt id=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | </ | ||
+ | <nt id=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | </ | ||
+ | <nt id=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | <edge label=" | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | The first sentence of the CoNLL 2006 training data: | ||
+ | |||
+ | | 1 | `` | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC | | ||
+ | | 2 | Ross | _ | NE | NE | _ | 4 | SB | 4 | SB | | ||
+ | | 3 | Perot | _ | NE | NE | _ | 2 | PNC | 2 | PNC | | ||
+ | | 4 | wäre | _ | VAFIN | VAFIN | _ | 0 | ROOT | 0 | ROOT | | ||
+ | | 5 | vielleicht | _ | ADV | ADV | _ | 4 | MO | 4 | MO | | ||
+ | | 6 | ein | _ | ART | ART | _ | 8 | NK | 8 | NK | | ||
+ | | 7 | prächtiger | _ | ADJA | ADJA | _ | 8 | NK | 8 | NK | | ||
+ | | 8 | Diktator | _ | NN | NN | _ | 4 | PD | 4 | PD | | ||
+ | | 9 | < | ||
+ | |||
+ | The first sentence of the CoNLL 2006 test data: | ||
+ | |||
+ | | 1 | Zwei | _ | CARD | CARD | _ | 2 | NK | 2 | NK | | ||
+ | | 2 | Themen | _ | NN | NN | _ | 14 | SB | 14 | SB | | ||
+ | | 3 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | | ||
+ | | 4 | die | _ | PRELS | PRELS | _ | 8 | OA | 8 | OA | | ||
+ | | 5 | Perot | _ | NE | NE | _ | 8 | SB | 8 | SB | | ||
+ | | 6 | immer | _ | ADV | ADV | _ | 7 | MO | 7 | MO | | ||
+ | | 7 | wieder | _ | ADV | ADV | _ | 8 | MO | 8 | MO | | ||
+ | | 8 | anspricht | _ | VVFIN | VVFIN | _ | 2 | RC | 2 | RC | | ||
+ | | 9 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | | ||
+ | | 10 | Rezession | _ | NN | NN | _ | 2 | APP | 2 | APP | | ||
+ | | 11 | und | _ | KON | KON | _ | 10 | CD | 10 | CD | | ||
+ | | 12 | Bürokratie | _ | NN | NN | _ | 10 | CJ | 10 | CJ | | ||
+ | | 13 | , | _ | $, | $, | _ | 14 | PUNC | 14 | PUNC | | ||
+ | | 14 | machen | _ | VVFIN | VVFIN | _ | 0 | ROOT | 0 | ROOT | | ||
+ | | 15 | ihnen | _ | PPER | PPER | _ | 18 | DA | 18 | DA | | ||
+ | | 16 | besonders | _ | ADV | ADV | _ | 18 | MO | 18 | MO | | ||
+ | | 17 | zu | _ | PTKZU | PTKZU | _ | 18 | PM | 18 | PM | | ||
+ | | 18 | schaffen | _ | VVINF | VVINF | _ | 14 | OC | 14 | OC | | ||
+ | | 19 | . | _ | $. | $. | _ | 14 | PUNC | 14 | PUNC | | ||
+ | |||
+ | The first sentence of the CoNLL 2009 training data: | ||
+ | |||
+ | | 1 | `` | _ | `` | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ | | ||
+ | | 2 | Ross | Ross | Roß | NE | NN | Nom< | ||
+ | | 3 | Perot | Perot | Perot | NE | NE | Nom< | ||
+ | | 4 | wäre | sein | sein | VAFIN | VAFIN | 3< | ||
+ | | 5 | vielleicht | vielleicht | vielleicht | ADV | ADV | _ | _ | 4 | 4 | MO | MO | _ | _ | | ||
+ | | 6 | ein | ein | ein | ART | ART | Nom< | ||
+ | | 7 | prächtiger | prächtig | prächtig | ADJA | ADJA | Pos< | ||
+ | | 8 | Diktator | Diktator | Diktator | NN | NN | Nom< | ||
+ | | 9 | < | ||
+ | |||
+ | The first sentence of the CoNLL 2009 development data: | ||
+ | |||
+ | | 1 | Maschinenbau | Maschinenbau | Maschinenbau | NN | NN | Nom< | ||
+ | | 2 | / | _ | / | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | | ||
+ | | 3 | ( | _ | ( | $( | $( | _ | _ | 0 | 4 | PUNC | PUNC | _ | _ | | ||
+ | | 4 | Zusammenfassung | Zusammenfassung | Zusammenfassung | NN | NN | Nom< | ||
+ | | 5 | ) | _ | ) | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | | ||
+ | |||
+ | The first sentence of the CoNLL 2009 test data: | ||
+ | |||
+ | | 1 | Gegen | gegen | gegen | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
+ | | 2 | eine | ein | ein | ART | ART | Acc< | ||
+ | | 3 | Erweiterung | Erweiterung | Erweiterung | NN | NN | Acc< | ||
+ | | 4 | ihrer | ihr | ihr | PPOSAT | PPOSAT | Gen< | ||
+ | | 5 | Organisation | Organisation | Organisation | NN | NN | Gen< | ||
+ | | 6 | zu | zu | zu | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
+ | | 7 | einem | ein | ein | ART | ART | Dat< | ||
+ | | 8 | sicherheitspolitischen | sicherheitspolitisch | sicherheitspolitisch | ADJA | ADJA | Pos< | ||
+ | | 9 | Forum | Forum | Forum | NN | NN | Dat< | ||
+ | | 10 | sprachen | sprechen | sprechen | VVFIN | VVFIN | 3< | ||
+ | | 11 | sich | sich | er< | ||
+ | | 12 | die | der | d | ART | ART | Nom< | ||
+ | | 13 | meisten | meister | meist | PIAT | PIAT | Nom< | ||
+ | | 14 | Staaten | Staat | Staat | NN | NN | Nom< | ||
+ | | 15 | beim | bei | beim | APPRART | APPRART | Dat< | ||
+ | | 16 | Gipfeltreffen | Gipfeltreffen | Gipfeltreffen | NN | NN | Dat< | ||
+ | | 17 | für | für | für | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
+ | | 18 | Asiatisch-Pazifische | asiatisch-pazifisch | Asiatisch-Pazifische | ADJA | NN | Pos< | ||
+ | | 19 | Wirtschaftskooperation | Wirtschaftskooperation | Wirtschaftskooperation | NN | NN | Acc< | ||
+ | | 20 | ( | _ | ( | $( | $( | _ | _ | _ | _ | _ | _ | _ | | ||
+ | | 21 | Apec | Apec | _ | NE | NE | Nom< | ||
+ | | 22 | ) | _ | ) | $( | $( | _ | _ | _ | _ | _ | _ | _ | | ||
+ | | 23 | in | in | in | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
+ | | 24 | Osaka | Osaka | Osaka | NE | NE | Dat< | ||
+ | | 25 | aus | aus | aus | PTKVZ | PTKVZ | _ | _ | _ | _ | _ | _ | _ | | ||
+ | | 26 | . | _ | . | $. | $. | _ | _ | _ | _ | _ | _ | _ | | ||
+ | |||
+ | ==== Parsing ==== | ||
+ | |||
+ | TIGER is a mildly nonprojective treebank. 15875 of the 680,710 tokens in the CoNLL 2009 training+development datasets are attached nonprojectively (2.33%). | ||
+ | |||
+ | The results of the CoNLL 2006 shared task are [[http:// | ||
+ | |||
+ | ^ Parser (Authors) ^ LAS ^ UAS ^ | ||
+ | | MST (McDonald et al.) | 87.34 | 90.38 | | ||
+ | | Riedel et al. | 86.24 | 89.76 | | ||
+ | | Basis (O' | ||
+ | | Malt (Nivre et al.) | 85.82 | 88.76 | | ||
+ | |||
+ | The results of the CoNLL 2009 shared task are [[http:// | ||
+ | |||
+ | ^ Parser (Authors) ^ LAS ^ | ||
+ | | Bohnet | 87.48 | | ||
+ | | Merlo | 87.29 | | ||
+ | | Chen | 86.24 | | ||
+ | | Che | 86.19 | | ||
+ | |||
+ | ===== Greek (el) ===== | ||
+ | |||
+ | Greek Dependency Treebank (GDT) | ||
+ | |||
+ | ==== Versions ==== | ||
+ | |||
+ | * CoNLL 2007 | ||
+ | |||
+ | ==== Obtaining and License ==== | ||
+ | |||
+ | There does not seem to be any regular distribution channel for the Greek Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Prokopis Prokopidis (prokopis (at) ilsp (dot) gr) about availability of the corpus. | ||
+ | |||
+ | GDT was created by members of the [[http:// | ||
+ | |||
+ | ==== References ==== | ||
+ | |||
+ | * Website | ||
+ | * //no website dedicated to the treebank// | ||
+ | * Data | ||
+ | * //no separate citation// | ||
+ | * Principal publications | ||
+ | * Prokopis Prokopidis, Elina Desipri, Maria Koutsombogera, | ||
+ | * Documentation | ||
+ | * Description of tags and feature values is provided in the '' | ||
+ | |||
+ | ==== Domain ==== | ||
+ | |||
+ | Mixed (“GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairs, manual transcripts and minutes of European parliamentary sessions), health, and travel.”) | ||
+ | |||
+ | ==== Size ==== | ||
+ | |||
+ | The CoNLL 2007 version contains 70223 tokens in 2902 sentences, yielding 24.20 tokens per sentence on average (CoNLL 2007 data split: 65419 tokens / 2705 sentences training, 4804 tokens / 197 sentences test). | ||
+ | |||
+ | ==== Inside ==== | ||
+ | |||
+ | The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http:// | ||
+ | |||
+ | The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. | ||
+ | |||
+ | The guidelines for syntactic annotation are documented in the other [[http:// | ||
+ | |||
+ | ==== Sample ==== | ||
+ | |||
+ | The first sentence of the CoNLL 2007 training data: | ||
+ | |||
+ | | 1 | " | " | PUNCT | PUNCT | _ | 10 | AuxG | _ | _ | | ||
+ | | 2 | Τα | ο | At | AtDf | Ne< | ||
+ | | 3 | αντισώματα | αντίσωμα | No | NoCm | Ne< | ||
+ | | 4 | IgG | IgG | Rg | RgFwOr | _ | 3 | Atr | _ | _ | | ||
+ | | 5 | είναι | είμαι | Vb | VbMn | Id< | ||
+ | | 6 | σαν | σαν | Ad | Ad | Ba | 5 | Adv | _ | _ | | ||
+ | | 7 | μακροπρόθεσμη | μακροπρόθεσμος | Aj | Aj | Ba< | ||
+ | | 8 | μνήμη | μνήμη | No | NoCm | Fe< | ||
+ | | 9 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ | | ||
+ | | 10 | ενώ | ενώ | Cj | CjCo | _ | 26 | Coord | _ | _ | | ||
+ | | 11 | το | ο | At | AtDf | Ne< | ||
+ | | 12 | IgA | IgA | Rg | RgFwOr | _ | 15 | Sb | _ | _ | | ||
+ | | 13 | πιστεύεται | πιστεύεται | Vb | VbMn | Id< | ||
+ | | 14 | ότι | ότι | Cj | CjSb | _ | 13 | AuxC | _ | _ | | ||
+ | | 15 | είναι | είμαι | Vb | VbMn | Id< | ||
+ | | 16 | ένας | ένας | At | AtId | Ma< | ||
+ | | 17 | συγκεκριμένος | συγκεκριμένος | Aj | Aj | Ba< | ||
+ | | 18 | δείκτης | δείκτης | No | NoCm | Ma< | ||
+ | | 19 | για | για | AsPp | AsPpSp | _ | 18 | AuxP | _ | _ | | ||
+ | | 20 | πρόσφατες | πρόσφατος | Aj | Aj | Ba< | ||
+ | | 21 | ή | ή | Cj | CjCo | _ | 23 | Coord | _ | _ | | ||
+ | | 22 | χρόνιες | χρόνιος | Aj | Aj | Ba< | ||
+ | | 23 | λοιμώξεις | λοίμωξη | No | NoCm | Fe< | ||
+ | | 24 | " | " | PUNCT | PUNCT | _ | 10 | AuxG | _ | _ | | ||
+ | | 25 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ | | ||
+ | | 26 | εξηγεί | εξηγώ | Vb | VbMn | Id< | ||
+ | | 27 | η | ο | At | AtDf | Fe< | ||
+ | | 28 | Δρ | Δρ | Rg | RgFwTr | _ | 26 | Sb | _ | _ | | ||
+ | | 29 | Αρκάρι | Αρκάρι | No | NoCm | Ne< | ||
+ | | 30 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ | | ||
+ | |||
+ | The first sentence of the CoNLL 2007 test data: | ||
+ | |||
+ | | 1 | Η | ο | At | AtDf | Fe< | ||
+ | | 2 | Σίφνος | Σίφνος | No | NoPr | Fe< | ||
+ | | 3 | φημίζεται | φημίζομαι | Vb | VbMn | Id< | ||
+ | | 4 | και | και | Cj | CjCo | _ | 5 | AuxY | _ | _ | | ||
+ | | 5 | για | για | AsPp | AsPpSp | _ | 3 | AuxP | _ | _ | | ||
+ | | 6 | τα | ο | At | AtDf | Ne< | ||
+ | | 7 | καταγάλανα | καταγάλανος | Aj | Aj | Ba< | ||
+ | | 8 | νερά | νερό | No | NoCm | Ne< | ||
+ | | 9 | των | ο | At | AtDf | Fe< | ||
+ | | 10 | πανέμορφων | πανέμορφος | Aj | Aj | Ba< | ||
+ | | 11 | ακτών | ακτή | No | NoCm | Fe< | ||
+ | | 12 | της | μου | Pn | PnPo | Fe< | ||
+ | | 13 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ | | ||
==== Parsing ==== | ==== Parsing ==== |