[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:fa [2012/01/29 18:19]
zeman Update. I have seen the data!
user:zeman:treebanks:fa [2012/01/29 21:10]
zeman Size.
Line 35: Line 35:
 ==== Size ==== ==== Size ====
  
-12200 annotated sentences.+PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
  
 ==== Inside ==== ==== Inside ====
Line 42: Line 42:
  
 ==== Sample ==== ==== Sample ====
 +
 +The first sentence of the corpus in the CoNLL format:
 +
 +| 1 | به | به | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 2 | گزارش | گزارش | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 1 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 3 | خبرنگار | خبرنگار | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 2 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 4 | مهر | مهر | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 3 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 5 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 3 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 6 | گرگان | گرگان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 5 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 7 | <nowiki>،</nowiki> | <nowiki>،</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 8 | بر | بر | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 9 | اساس | اساس | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 8 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 10 | باورهای | باور | N | IANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 9 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 11 | دینی | دینی | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 10 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 12 | <nowiki>ترکمن‌ها</nowiki> | ترکمن | N | ANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 10 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 13 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 14 | این | این | PREM | DEMAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 15 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 15 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 13 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 16 | برای | برای | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 17 | پیامبر | پیامبر | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 16 | VPP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 18 | اکرم | اکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 19 | <nowiki>(</nowiki> | <nowiki>(</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 20 | ص | ص | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | APP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 21 | <nowiki>)</nowiki> | <nowiki>)</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 22 | ناراحتی | ناراحتی | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 23 | و | و | CONJ | CONJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 22 | NCONJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 24 | بیماری | بیماری | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 23 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 25 | رخ | رخ | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | NVE | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 26 | داد | <nowiki>داد#ده</nowiki> | V | ACT | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 27 | که | که | SUBR | SUBR | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | AJUCL | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 28 | چند | چند | PREM | AMBAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 29 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 30 | بعد | بعد | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 31 | با | با | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 32 | رحلت | رحلت | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 31 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 33 | نبی | نبی | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 32 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 34 | مکرم | مکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 33 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 35 | اسلام | اسلام | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 33 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 36 | جهان | جهان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 37 | عزادار | عزادار | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | MOS | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 38 | ماتمش | ماتم | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 37 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 39 | شد | <nowiki>کرد#کن</nowiki> | V | PASS | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> | 27 | PRD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 40 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
  
 ==== Parsing ==== ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]