Formát CoNLL

Jednoduchý sloupcový formát použitý pro uložení treebanků pro účely soutěže CoNLL v závislostním parsingu. Seznam treebanků, které máme v tomto formátu, najdete na stránce Data.

Každý řádek odpovídá jednomu slovu původního textu, věty jsou oddělené prázdným řádkem. Na řádku je předem známý počet hodnot (sloupců), oddělených tabulátory. To jsou hodnoty jednotlivých atributů daného slova. Podrobnější popis formátu najdete např. na http://depparse.uvt.nl/depparse-wiki/DataFormat.

Formát CoNLL 2009

Pozor! Kromě toho, že pro rozšířenou úlohu (označování sémantických rolí) potřebujeme nové sloupce, došlo oproti letům 2006 a 2007 i ke změně ve starých sloupcích! Nový formát je popsán na http://ufal.mff.cuni.cz/conll2009-st/task-description.html#Dataformat. Následující tabulka porovnává oba formáty. Vlevo jsou sloupce z CoNLL 2006, vpravo z CoNLL 2009.

Field number	Field name 2006	Field name 2009
1	ID	ID
2	FORM	FORM
3	LEMMA	LEMMA
4	CPOSTAG	PLEMMA
5	POSTAG	POS
6	FEATS	PPOS
7	HEAD	FEAT
8	DEPREL	PFEAT
9	PHEAD	HEAD
10	PDEPREL	PHEAD
11		DEPREL
12		PDEPREL
13		FILLPRED
14		PRED
15		APREDs

Následující tabulka vysvětluje, co jednotlivá pole znamenají.

Field name	Description
ID	Token counter, starting at 1 for each new sentence.
FORM	Word form or punctuation symbol.
LEMMA	Lemma or stem (depending on particular data set) of word form, or an underscore if not available.
CPOSTAG	Coarse-grained part-of-speech tag, where tagset depends on the language.
POSTAG, POS	Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
FEATS, FEAT	Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (\|), or an underscore if not available.
HEAD	Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
DEPREL	Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'.
PLEMMA, PPOS, PFEAT	2009: automatically predicted values of LEMMA, POS, FEAT
PHEAD	2006: Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available). 2009: PHEAD contains automatically predicted value of HEAD!
PDEPREL	2006: Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningful or simply 'ROOT'. 2009: PDEPREL contains automatically predicted value of DEPREL!

Převody z a do jiných formátů

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Formát CoNLL

Formát CoNLL 2009

Převody z a do jiných formátů