{{template>Šablona:Infobox resource | name = Talbanken05 | owner = zeman | path = /fs/clip-corpora/conll/swedish | version = 2006 }} Talbanken05 is the Swedish treebank used in the CoNLL-06 shared task. There are 11,042 training sentences, 191,467 training words, 389 test sentences, and 5,656 test words. Average length of training sentences is 17 words, 15 words for test sentences. The data belong to mixed domains. =====Parser adaptation data split===== For the experiments in [[Parser adaptation]], we save the CoNLL test data for the final evaluation and cut a development test set from the original training data. We split the original set to 10,700 training sentences and 342 test sentences. These sentences are not visible to the parsers during any phase of training. So, for instance, if reranker asks for “development data,” which is in fact held-out data used for tuning its weights (i.e. learning), these have to be cut from the 10,700 training sentences. We can perform two final evaluations: 1. trained exactly the same way as during the development; 2. trained on the union of the training and the development test data. =====Data preparation===== The data is in ''/fs/clip-corpora/conll/swedish''. This section describes data preparation for the [[parser adaptation]] experiment. The original training data (to be split into our training and our development test) is called ''otrain'', our training data is called ''train'', the development test data is called ''dtest'' and the final evaluation data is called ''etest''. ====Convert the treebank from the CoNLL format to [[CSTS]]==== $PARSINGROOT/tools/conll2csts.pl -l sv < otrain.conll > otrain.csts ====Normalize trees==== Transform the treebank so that it conforms to treebanking guidelines used in other treebanks. At the same time, convert morphological tags to the part-of-speech tagset of the Penn Treebank. $PARSINGROOT/tools/normalize_swedish_csts_trees.pl < otrain.csts > otrain.normalized.csts Normalization and new tags can be viewed in Tred, if desired. To do that, we need to convert the normalized data to the [[FS format]] (because Tred does not allow CSTS encoded in UTF-8). This step is optional. $PARSINGROOT/tools/cstsfs.pl < otrain.normalized.csts > otrain.normalized.fs /fs/nlp/Programs/tred/tred otrain.normalized.fs ====Convert dependencies to constituents==== The flattest possible structure is created. The constituent labels (nonterminals) are derived from part-of-speech tags of the heads and then translated to the repertory of the Penn Treebank. $PARSINGROOT/tools/csts2penn.pl otrain.normalized.csts > otrain.penn ====Split the data==== First do the steps described above separately for ''otrain'' and ''etest''. We do not split ''otrain'' to ''train'' and ''dtest'' earlier because until now we have not had one sentence per line (and the splitting process is much easier once we have it). head -10700 otrain.penn > train.penn tail -342 otrain.penn > dtest.penn ====Get plain text of test data==== We need plain text as input to the parser. $PARSINGROOT/tools/penn2text.pl < dtest.penn > dtest.txt $PARSINGROOT/tools/penn2text.pl < etest.penn > etest.txt =====Parsing experiments===== ====Train and test Charniak==== $PARSINGROOT/charniak-parser/scripts/train.pl < train.penn > train.ecdata.tgz $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.ecdata.tgz < dtest.txt | \ $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ec.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.penn P = 72.37 %, R = 73.68 %, F = 73.02 %. Evaluated 317 sentences. ====Learning curve==== We repeat the experiment with various training data sizes to see the relation between data size and parsing accuracy. Smaller data sets are always cut from the beginning of the training data, e.g.: foreach i (50 100 250 500 1000 2500 5000 10700) head -$i train.penn > train.$i.penn $PARSINGROOT/charniak-parser/scripts/train.pl < train.$i.penn > train.$i.ecdata.tgz $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.$i.ecdata.tgz < dtest.txt > dtest.ec.$i.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.$i.penn > result.$i.txt end The results are summarized in the following table: | sentences | 10,700 (full) | 5,000 | 2,500 | 1,000 | 500 | 250 | 100 | 50 | | precision | 72.37 | 69.74 | 67.51 | 63.65 | 59.23 | 54.61 | 44.22 | 42.72 | | recall | 73.68 | 70.48 | 67.93 | 63.63 | 58.94 | 53.48 | 41.55 | 40.28 | | F | 73.02 | 70.11 | 67.72 | 63.64 | 59.08 | 54.03 | 42.84 | 41.46 | To see how much the F-score of the smaller data sets possibly depends on where have the data been taken from, we also run a modification of the experiment, with ''tail'' instead of ''head'': foreach i (50 100 250 500 1000 2500 5000 10700) tail -$i train.penn > train.tail.$i.penn $PARSINGROOT/charniak-parser/scripts/train.pl < train.tail.$i.penn > train.tail.$i.ecdata.tgz $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.tail.$i.ecdata.tgz < dtest.txt > dtest.ec.tail.$i.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.tail.$i.penn > result.tail.$i.txt end The results are summarized in the following table: | sentences | 10,700 (full) | 5,000 | 2,500 | 1,000 | 500 | 250 | 100 | 50 | | precision | 72.37 | 70.36 | 68.62 | 66.48 | 61.24 | 55.38 | 46.64 | 38.82 | | recall | 73.68 | 71.39 | 69.35 | 66.89 | 61.75 | 55.29 | 45.36 | 37.67 | | F | 73.02 | 70.87 | 68.98 | 66.69 | 61.49 | 55.33 | 45.99 | 38.23 | ====Brown==== For a long time, we were not able to train the Brown reranking parser on Swedish because of a bug in the treebank normalization script that did not encode < as &lt; on output. The bug was fixed in March 2007 and training now works: nohup nice $PARSINGROOT/brown-reranking-parser/scripts/train.pl -nick sv -reuse \ < train.penn \ > train.ecmj.tgz Parse Swedish test data using ''train.ecmj.tgz'' just trained. nohup nice $PARSINGROOT/brown-reranking-parser/scripts/parse.pl -g train.ecmj.tgz \ < dtest.txt \ > dtest.ecmj.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ecmj.penn P = 73.22 %, R = 72.58 %, F = 72.90 %, T = 49.17 %. Evaluated 317 sentences. **Strange that the reranker did not help. I have currently no explanation.** =====Delexicalization===== By //delexicalization// we mean replacing words by their morphological tags. We need it for the [[parser adaptation]] experiments. After delexicalization, Swedish tags will be terminals while the preterminals will still contain the simpler Penn-style tags. The situation is different from [[Danish Dependency Treebank#Delexicalization|Danish]] in that the manually assigned Swedish tags are too coarse. That's why we first tag the corpus using the _[^<]*//' < otrain.iso.csts > otrain.iso.nomorph.csts ~zeman/nastroje/taggery/hajic-sv/2006-11-08/SE061108x TG \ otrain.iso.nomorph.csts \ otrain.hajic.iso.csts iconv -f iso-8859-1 -t utf8 < otrain.hajic.iso.csts > otrain.hajic.csts perl -pe 's/]*>[^<]*//g; s/]*>/<$1>/g' \ < otrain.hajic.csts \ > otrain.hajic1.csts perl -pe 's/(\.+\.+)N[^<]+/$1FE-------/' \ < otrain.hajic1.csts \ > otrain.hajic2.csts $PARSINGROOT/tools/normalize_and_delexicalize_swedish_csts_trees.pl \ < otrain.hajic2.csts \ > otrain.delex.csts cstsfs.pl < otrain.delex.csts > otrain.delex.fs $PARSINGROOT/tools/csts2penn.pl otrain.delex.csts > otrain.delex.penn head -10700 otrain.delex.penn > train.delex.penn tail -342 otrain.delex.penn > dtest.delex.penn $PARSINGROOT/tools/penn2text.pl < dtest.delex.penn > dtest.delex.txt ====Parsing delexicalized treebank==== **Note:** There are multiple sources of morphological tags used to replace word forms in delexicalized nodes: * Original Mamba tags, manually assigned, coarse-grained, although some information exceeds usual resolution (i.e. the fine-grained categorization of punctuation marks). * Manual tags converted from Mamba to Hajič tag set. Some information (e.g. the punctuation categorization) gets lost. * Statistically assigned Hajič tags. The results in this section refer to experiments with the statistically assigned Hajič tags. Train Charniak on delexicalized data, parse delexicalized test data, evaluate restuffed trees. $PARSINGROOT/charniak-parser/scripts/train.pl < train.delex.penn > train.delex.ec.tgz $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.delex.ec.tgz < dtest.delex.txt \ '> ~/projekty/stanford/tools/restuff.pl -s dtest.txt > dtest.delex.ec.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.penn Evaluation (evalb, sentences of 40 or less tokens): P = 66.94 %, R = 67.77 %, F = 67.35 %, T = 35.57 %. Evaluated 317 sentences. The probable reason why the results are significantly worse than for Danish is that we are using automatically assigned tags, instead of the gold standard. ===Brown=== nohup nice $PARSINGROOT/brown-reranking-parser/scripts/train.pl -nick sv-delex -reuse \ < train.delex.penn \ > train.delex.ecmj.tgz Evaluation (evalb, sentences of 40 or less tokens): P = 68.67 %, R = 68.87 %, F = 68.77 %, T = 35.57 %. Evaluated 317 sentences. ====Preterminal Mamba tags, terminal Hajič tags==== perl -e ' open(H, "otrain.hajic2.csts"); open(O, "otrain.csts"); while() { $o = ; $o =~ m/([^<]*)/; $ot = $1; $ot =~ s/\t.*//; s//$ot\t/; print; }' > otrain.mamba+hajic2.csts $PARSINGROOT/tools/normalize_and_delexicalize_swedish_csts_trees.pl \ < otrain.mamba+hajic2.csts \ > otrain.delex.csts $PARSINGROOT/tools/csts2penn.pl otrain.delex.csts > otrain.delex.penn head -10700 otrain.delex.penn > train.delex.penn tail -342 otrain.delex.penn > dtest.delex.penn $PARSINGROOT/tools/penn2text.pl < dtest.delex.penn > dtest.delex.txt $PARSINGROOT/charniak-parser/scripts/train.pl < train.delex.penn > train.delex.ec.tgz $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.delex.ec.tgz < dtest.delex.txt \ | ~/projekty/stanford/tools/restuff.pl -s dtest.txt \ > dtest.delex.ec.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.penn Evaluation (evalb, sentences of 40 or less tokens): P = 67.22 %, R = 68.48 %, F = 67.84 %, T = 45.26 %. Evaluated 276 sentences.