[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
user:zeman:self-training [2008/07/08 17:53]
zeman vytvořeno (přeneseno z wiki CLIP a převedeno z MediaWiki do DokuWiki
user:zeman:self-training [2008/07/09 16:43] (current)
zeman
Line 1: Line 1:
-This page describes an experiment conducted by [[User:Zeman|Dan Zeman]] in November and December 2006.+This page describes an experiment conducted by [[User:Zeman:start|Dan Zeman]] in November and December 2006.
  
-I am trying to repeat the experiment of David McClosky, Eugene Charniak, and Mark Johnson ([http://www.cog.brown.edu/~mj/papers/naacl06-self-train.pdf NAACL 2006, New York]) with self-training a parser. The idea is that you train a parser on small data, run it over big data, re-train it on its own output for the big data, and have a better-performing parser. The folks at Brown University used Charniak's reranking parser, i.e. a parser-reranker sequence. The big data was parsed by the whole reranking parser but only the first-stage parser was retrained on it. The reranker only saw the small data.+I am trying to repeat the experiment of David McClosky, Eugene Charniak, and Mark Johnson ([[http://www.cog.brown.edu/~mj/papers/naacl06-self-train.pdf|NAACL 2006, New York]]) with self-training a parser. The idea is that you train a parser on small data, run it over big data, re-train it on its own output for the big data, and have a better-performing parser. The folks at Brown University used Charniak's reranking parser, i.e. a parser-reranker sequence. The big data was parsed by the whole reranking parser but only the first-stage parser was retrained on it. The reranker only saw the small data.
  
 Once the original self-training experiment works as expected, we are going to use a similar scheme for [[Parser Adaptation|parser adaptation]] to a new language. Once the original self-training experiment works as expected, we are going to use a similar scheme for [[Parser Adaptation|parser adaptation]] to a new language.
Line 9: Line 9:
 Note: I am going to move around some stuff, especially that in my home folder. Note: I am going to move around some stuff, especially that in my home folder.
  
-  * ''$PARSINGROOT'' - working copy of the parsers and related scripts. See [[Parsing]] on how to create your own.+  * ''$PARSINGROOT'' - working copy of the parsers and related scripts. See [[:parsery|Parsing]] on how to create your own.
   * ''/fs/clip-corpora/ptb/processed'' - [[Penn Treebank]] (referred to as ''$PTB'')   * ''/fs/clip-corpora/ptb/processed'' - [[Penn Treebank]] (referred to as ''$PTB'')
   * ''/fs/clip-corpora/north_american_news'' - [[North American News Text Corpus]], including everything I made of it   * ''/fs/clip-corpora/north_american_news'' - [[North American News Text Corpus]], including everything I made of it
Line 57: Line 57:
 </code> </code>
  
- +| Section | 22 || 23 || 
-| Section   ||colspan=222 ||colspan=2| 23 +| Parser | Charniak | Brown | Charniak | Brown | 
- +| Precision | 90.54 | 92.81 | 90.43 | 92.35 | 
- Parser      Charniak   Brown   Charniak   Brown | +| Recall | 90.43 | 91.92 | 90.21 | 91.61 | 
- +| F-score | 90.48 | 92.36 | 90.32 | 91.98 | 
- Precision   90.54   92.81   90.43   92.35 | +| Tagging | 96.15 | 92.41 | 96.78 | 92.33 | 
- +| Crossing | 0.66 | 0.49 | 0.72 | 0.59 |
- Recall      90.43   91.92   90.21   91.61 | +
- +
- F-score     90.48   92.36   90.32   91.98 | +
- +
- Tagging     96.15   92.41   96.78   92.33 | +
- +
- Crossing     0.66    0.49    0.72    0.59 |+
  
  
Line 76: Line 69:
  
 See [[North American News Text Corpus]] for more information on the data and its preparation. See [[North American News Text Corpus]] for more information on the data and its preparation.
 +
  
 =====Parsing NANTC using P<sub>0</sub>===== =====Parsing NANTC using P<sub>0</sub>=====
  
-See [[Parsers|here]] for more information on the Brown Reranking Parser. We parsed the LATWP part of NANTC on the C cluster using the following command:+See [[:Parsery|here]] for more information on the Brown Reranking Parser. We parsed the LATWP part of NANTC on the C cluster using the following command:
  
 <code> <code>
 cd /fs/clip-corpora/north_american_news cd /fs/clip-corpora/north_american_news
-$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -l en -g wsj.tgz &lt; latwp.04.clean.txt \+$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -l en -g wsj.tgz latwp.04.clean.txt \
     -o latwp.05a.brown.penn -w workdir05 -k     -o latwp.05a.brown.penn -w workdir05 -k
 </code> </code>
Line 91: Line 85:
 =====Retraining the first-stage parser===== =====Retraining the first-stage parser=====
  
-The following command trains the Charniak parser on 5 copies of the sections 02-21 of the Penn Treebank Wall Street Journal, and 1 copy of the parsed part of NANTC (<span style="background:yellow">3,143,433</span> sentences).+The following command trains the Charniak parser on 5 copies of the sections 02-21 of the Penn Treebank Wall Street Journal, and 1 copy of the parsed part of NANTC (<html><span style="background:yellow">3,143,433</span></html> sentences).
  
 <code> <code>
-$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.05a.brown.penn &gt; ptb+latwp3000.tgz+$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.05a.brown.penn ptb+latwp3000.tgz
 </code> </code>
  
Line 103: Line 97:
 <code> <code>
 $PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \ $PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
-    &lt; $PTB/ptbwsj22.txt \ +    $PTB/ptbwsj22.txt \ 
-    &gt; ptbwsj22.ec.ptb+latwp3000.penn+    ptbwsj22.ec.ptb+latwp3000.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp3000.penn
  
 $PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \ $PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
-    &lt; $PTB/ptbwsj23.txt \ +    $PTB/ptbwsj23.txt \ 
-    &gt; ptbwsj23.ec.ptb+latwp3000.penn+    ptbwsj23.ec.ptb+latwp3000.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp3000.penn
 </code> </code>
  
- +| Section | 22 | 23 | 
- Section     22   23 | +| Precision | 87.74 | 88.26 | 
- +| Recall | 88.65 | 88.54 | 
- Precision   87.74   88.26 | +| F-score | 88.19 | 88.40 | 
- +| Tagging | 92.67 | 92.84 | 
- Recall      88.65   88.54 | +| Crossing | 0.80 | 0.91 |
- +
- F-score     88.19   88.40 | +
- +
- Tagging     92.67   92.84 | +
- +
- Crossing     0.80    0.91 |+
  
  
Line 133: Line 121:
 <code> <code>
 $PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp3000.tgz wsj.tgz \ $PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp3000.tgz wsj.tgz \
-    &gt; ptb+latwp3000.brown.tgz+    ptb+latwp3000.brown.tgz
 </code> </code>
  
Line 140: Line 128:
 <code> <code>
 $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \ $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
-    &lt; $PTB/ptbwsj22.txt \+    $PTB/ptbwsj22.txt \
     -o ptbwsj22.br.ptb+latwp3000.penn     -o ptbwsj22.br.ptb+latwp3000.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp3000.penn
  
 $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \ $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
-    &lt; $PTB/ptbwsj23.txt \+    $PTB/ptbwsj23.txt \
     -o ptbwsj23.br.ptb+latwp3000.penn     -o ptbwsj23.br.ptb+latwp3000.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp3000.penn
 </code> </code>
  
- +| Section | 22 | 23 | 
- Section     22   23 | +| Precision | 90.39 | 90.68 | 
- +| Recall | 90.30 | 90.24 | 
- Precision   90.39   90.68 | +| F-score | 90.34 | 90.46 | 
- +| Tagging | 93.43 | 93.65 | 
- Recall      90.30   90.24 | +| Crossing | 0.61 | 0.71 |
- +
- F-score     90.34   90.46 | +
- +
- Tagging     93.43   93.65 | +
- +
- Crossing     0.61    0.71 |+
  
  
Line 169: Line 151:
  
 <code> <code>
-head -1750000 latwp.05a.brown.penn &gt; latwp.1750k.brown.penn+head -1750000 latwp.05a.brown.penn latwp.1750k.brown.penn
 </code> </code>
  
Line 175: Line 157:
  
 <code> <code>
-$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.1750k.brown.penn &gt; ptb+latwp1750.tgz+$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.1750k.brown.penn ptb+latwp1750.tgz
 </code> </code>
  
Line 182: Line 164:
 <code> <code>
 $PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp1750.tgz wsj.tgz \ $PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp1750.tgz wsj.tgz \
-    &gt; ptb+latwp1750.brown.tgz+    ptb+latwp1750.brown.tgz
 </code> </code>
  
Line 189: Line 171:
 <code> <code>
 $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \ $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
-    &lt; $PTB/ptbwsj22.txt \+    $PTB/ptbwsj22.txt \
     -o ptbwsj22.ec.ptb+latwp1750.penn     -o ptbwsj22.ec.ptb+latwp1750.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp1750.penn
  
 $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \ $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
-    &lt; $PTB/ptbwsj23.txt \+    $PTB/ptbwsj23.txt \
     -o ptbwsj23.ec.ptb+latwp1750.penn     -o ptbwsj23.ec.ptb+latwp1750.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp1750.penn
  
 $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \ $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
-    &lt; $PTB/ptbwsj22.txt \+    $PTB/ptbwsj22.txt \
     -o ptbwsj22.br.ptb+latwp1750.penn     -o ptbwsj22.br.ptb+latwp1750.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp1750.penn
  
 $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \ $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
-    &lt; $PTB/ptbwsj23.txt \+    $PTB/ptbwsj23.txt \
     -o ptbwsj23.br.ptb+latwp1750.penn     -o ptbwsj23.br.ptb+latwp1750.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp1750.penn
Line 214: Line 196:
  
 <code> <code>
-$PARSINGROOT/charniak-parser/scripts/train.pl &lt; ptbwsj02-21.5times.penn &gt; 5ptb.tgz+$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn 5ptb.tgz
 </code> </code>
  
Line 221: Line 203:
 <code> <code>
 $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \ $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
-    &lt; $PTB/ptbwsj22.txt \+    $PTB/ptbwsj22.txt \
     -o ptbwsj22.ec.5ptb.penn     -o ptbwsj22.ec.5ptb.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.5ptb.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.5ptb.penn
  
 $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \ $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
-    &lt; $PTB/ptbwsj23.txt \+    $PTB/ptbwsj23.txt \
     -o ptbwsj23.ec.5ptb.penn     -o ptbwsj23.ec.5ptb.penn
 $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.5ptb.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.5ptb.penn
Line 237: Line 219:
 Remember that Charniak parser means without reranker, Brown parser means with reranker. PTB WSJ (or WSJ) in training means sections 02-21. 50k NANTC means 50,000 sentences of NANTC LATWP. Remember that Charniak parser means without reranker, Brown parser means with reranker. PTB WSJ (or WSJ) in training means sections 02-21. 50k NANTC means 50,000 sentences of NANTC LATWP.
  
- +| Parsing NANTC using || Parsing test using ||  Section  |||
-|colspan=2 rowspan=2| Parsing NANTC using ||colspan=2 rowspan=2| Parsing test using ||colspan=4 align=centerSection + ||  ||  22  ||  23  |
- +| parser | trained on | parser | trained on | McClosky | Zeman | McClosky | Zeman | 
-|colspan=2 align=center| 22 ||colspan=2 align=center23 +  | Stanford | PTB WSJ |     86.5  | 
- +  | Charniak | PTB WSJ |  90.3   90.5   89.7   90.3  | 
- parser   trained on   parser   trained on   McClosky   Zeman   McClosky   Zeman | +| Brown | PTB WSJ | Charniak | WSJ + 50k NANTC |  90.7   |  |  | 
- +| Brown | PTB WSJ | Charniak | WSJ + 250k NANTC |  90.7   91.0    90.9  | 
-| || || Stanford || PTB WSJ || || || ||align=center| 86.5 +| Brown | PTB WSJ | Charniak | WSJ + 500k NANTC |  90.9   |  |  | 
- +| Brown | PTB WSJ | Charniak | WSJ + 750k NANTC |  91.0   |  |  | 
-| || || Charniak || PTB WSJ ||align=center| 90.3 ||align=center| 90.5 ||align=center| 89.7 ||align=center| 90.3 +| Brown | PTB WSJ | Charniak | WSJ + 1000k NANTC |  90.8   |  |  | 
- +| Brown | PTB WSJ | Charniak | WSJ + 1500k NANTC |  90.8   |  |  | 
-| Brown || PTB WSJ || Charniak || WSJ + 50k NANTC ||align=center| 90.7 +| Brown | PTB WSJ | Charniak | WSJ + 2000k NANTC |  91.0   |  |  | 
- +| Brown | PTB WSJ | Charniak | 5 × WSJ |  84.7   |  |  | 
-| Brown || PTB WSJ || Charniak || WSJ + 250k NANTC ||align=center| 90.7 ||align=center| 91.0 || ||align=center| 90.9 +| Brown | PTB WSJ | Charniak | 5 × WSJ + 1750k NANTC |   87.6   91.0   87.9  | 
- +| Brown | PTB WSJ | Charniak | 5 × WSJ + 3143k NANTC |   88.2    88.4  | 
-| Brown || PTB WSJ || Charniak || WSJ + 500k NANTC ||align=center| 90.9 +  | Brown | PTB WSJ |   92.4   91.3   92.0  | 
- +| Brown | PTB WSJ | Brown | WSJ + 50k NANTC |  92.4   |  |  | 
-| Brown || PTB WSJ || Charniak || WSJ + 750k NANTC ||align=center| 91.0 +| Brown | PTB WSJ | Brown | WSJ + 250k NANTC |  92.3   92.2    92.3  | 
- +| Brown | PTB WSJ | Brown | WSJ + 500k NANTC |  92.4   |  |  | 
-| Brown || PTB WSJ || Charniak || WSJ + 1000k NANTC ||align=center| 90.8 +| Brown | PTB WSJ | Brown | WSJ + 750k NANTC |  92.4   |  |  | 
- +| Brown | PTB WSJ | Brown | WSJ + 1000k NANTC |  92.2   |  |  | 
-| Brown || PTB WSJ || Charniak || WSJ + 1500k NANTC ||align=center| 90.8 +| Brown | PTB WSJ | Brown | WSJ + 1500k NANTC |  92.1   |  |  | 
- +| Brown | PTB WSJ | Brown | WSJ + 2000k NANTC |  92.0   |  |  | 
-| Brown || PTB WSJ || Charniak || WSJ + 2000k NANTC ||align=center| 91.0 +| Brown | PTB WSJ | Brown | 5 × WSJ + 1750k NANTC |   89.9   92.1   90.0  | 
- +| Brown | PTB WSJ | Brown | 5 × WSJ + 3143k NANTC |   90.3    90.5  |
-| Brown || PTB WSJ || Charniak || 5 × WSJ ||align=center| 84.7 +
- +
-| Brown || PTB WSJ || Charniak || 5 × WSJ + 1750k NANTC || ||align=center| 87.6 ||align=center| 91.0 ||align=center| 87.9 +
- +
-| Brown || PTB WSJ || Charniak || 5 × WSJ + 3143k NANTC || ||align=center| 88.2 || ||align=center| 88.4 +
- +
-| || || Brown || PTB WSJ || ||align=center| 92.4 ||align=center| 91.3 ||align=center| 92.0 +
- +
-| Brown || PTB WSJ || Brown || WSJ + 50k NANTC ||align=center| 92.4 +
- +
-| Brown || PTB WSJ || Brown || WSJ + 250k NANTC ||align=center| 92.3 ||align=center| 92.2 || ||align=center| 92.3 +
- +
-| Brown || PTB WSJ || Brown || WSJ + 500k NANTC ||align=center| 92.4 +
- +
-| Brown || PTB WSJ || Brown || WSJ + 750k NANTC ||align=center| 92.4 +
- +
-| Brown || PTB WSJ || Brown || WSJ + 1000k NANTC ||align=center| 92.2 +
- +
-| Brown || PTB WSJ || Brown || WSJ + 1500k NANTC ||align=center| 92.1 +
- +
-| Brown || PTB WSJ || Brown || WSJ + 2000k NANTC ||align=center| 92.0 +
- +
-| Brown || PTB WSJ || Brown || 5 × WSJ + 1750k NANTC || ||align=center| 89.9 ||align=center bgcolor=yellow| 92.1 ||align=center| 90.0 +
- +
-| Brown || PTB WSJ || Brown || 5 × WSJ + 3143k NANTC || ||align=center| 90.3 || ||align=center| 90.5 +
- +
- +
-[[Category:Experiments]] +
-[[Category:English]] +
-[[Category:Parsing]]+
  

[ Back to the navigation ] [ Back to the content ]