[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pml-haters [2007/05/02 04:37]
bojar
pml-haters [2007/06/05 06:35] (current)
bojar jen formatovani
Line 3: Line 3:
 (Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.) (Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.)
  
-Inspired by [[http://www.simson.net/ref/ugh.pdf|Unix Hater's Handbook]], I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.+Inspired by [[http://www.simson.net/ref/ugh.pdf|Unix Hater's Handbook]] (please don't feel offended by the title of this page or the book; read the book's preface and anti-foreword), I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.
  
 Links to additional tools are at the bottom of the page. Links to additional tools are at the bottom of the page.
 +
 +I strongly recommend [[http://www.jucs.org/jucs_3_9/why_we_need_an/Prechelt_L.pdf|Lutz Prechelt: Why We Need an Explicit Forum for Negative Results]] (esp. section 2.6) for an explanation why not speaking about weaknesses hurts.
  
 ===== In Spite of some Common Assumptions... ===== ===== In Spite of some Common Assumptions... =====
Line 16: Line 18:
 ===== Validation ===== ===== Validation =====
  
-Given a PML file, how do I validate it? I always forget... Please provide me with the one-liner to do the validation.+Given a PML file, how do I validate it?
  
 +For most purposes, a libxml2 (DOM) based validator
 +<code>/f/common/exec/validate_pml --pml-dir /f/common/share/pml --path /f/common/share/tred file_to_validate</code>should work fine and fast.
  
 +For huge files, use<code>/f/common/exec/validate_pml_stream --path /f/common/share/tred file_to_validate</code>which is based on Jing (SAX); Jing has no Zlib or stdin support, so some space in /tmp will be needed for temporary files.
 +Both scripts have decent user documentation. See inside the scripts if interested in the implementation details.
  
 ===== XSH Won't Work: Blame XML Namespaces ===== ===== XSH Won't Work: Blame XML Namespaces =====
Line 39: Line 45:
 The reason why ''cd tdata'' won't work is a badly specified XML namespace. This works: The reason why ''cd tdata'' won't work is a badly specified XML namespace. This works:
  
-<code>$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/+<code>$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/;
 $f/>cd pml:tdata $f/>cd pml:tdata
 $f/pml:tdata> $f/pml:tdata>
 </code> </code>
  
-You will have to write the ''pml:'' prefix before every tag name in every XPath! +Hintadd the regns command to your ~/.xsh2rc.
- +
-Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elementsI hate XML and will never stop hating it! +
- +
  
 +You will have to write the ''pml:'' prefix before every tag name in every XPath!
  
 +Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. You'll need to read XML (Namespaces) specification.
  
 ===== Number of Sentences ===== ===== Number of Sentences =====
Line 60: Line 64:
 This XPath would quickly give you the number of sentences: This XPath would quickly give you the number of sentences:
  
-<code>count(tdata/trees/LM)+<code>count(pml:tdata/pml:trees/pml:LM)
 </code> </code>
  
Line 69: Line 73:
 <code>cat file.t.xml \ <code>cat file.t.xml \
 | xsh -I - -C "regns pml http://ufal.mff.cuni.cz/pdt/pml/; count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null | xsh -I - -C "regns pml http://ufal.mff.cuni.cz/pdt/pml/; count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null
-<code>+</code> 
 + 
 +or just the following, if you have the regns command in your ~/.xsh2rc: 
 + 
 +<code>cat file.t.xml \ 
 +| xsh -I - -C "count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null 
 +</code>
  
 LT XML's sggrep allows a shorter notation: LT XML's sggrep allows a shorter notation:
Line 76: Line 86:
 </code> </code>
  
-The performance is comparable, on about 60k sentences (in about 8 files) the tools needed: +The performance is comparable, on about 60k sentences (in about 8 gzipped files) the tools needed: 
-<code>        LT XML       XSH + 
-real    0m50.541s    0m58.882s +<code>        LT XML       XSH           compare with 'wc -l' if we got rid of XML 
-user    0m55.828s    0m50.744s +real    0m50.541s    0m58.882s     0m1.371s 
-sys     0m4.284s     0m7.867s+user    0m55.828s    0m50.744s     0m1.470s 
 +sys     0m4.284s     0m7.867s      0m0.250s 
 +</code> 
 + 
 +Here is a one-liner in Perl that does not load the whole file into memory: 
 + 
 +<code>perl -MXML::LibXML::Reader -e 'my $r=XML::LibXML::Reader->new(location=>shift); $r->nextElement("trees"); $d=$r->depth; $r->read; while ($d<$r->depth) {$i++; $r->nextSibling} print $i,"\n"' file.t.xml
 </code> </code>
  
Line 87: Line 103:
 Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, file-m.xml, file-a.xml, file-t.xml) containing annotation of some 5000 sentences. I know the bug occurs in sentence 345. Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, file-m.xml, file-a.xml, file-t.xml) containing annotation of some 5000 sentences. I know the bug occurs in sentence 345.
  
-How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced?+How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? A XML-Reader based script by Petr Pajas demonstrates that:
  
 +<code>~pajas/projects/pml/tools/separate_t_tree.pl file-t.xml 345
 +</code>
 +
 +Creating such a suite is problematic because there can exist links from sentence 345 to previous sentences (from t-layer to a-layer for elided words, within t-layer for coreference). The above mentioned script does not take this issue into account.
  
 ===== Links to Useful Tools ===== ===== Links to Useful Tools =====
Line 95: Line 115:
  
 [[http://www.ltg.ed.ac.uk/software/xml/|LT XML]] - tools like sggrep, sgcount, knit... for handling SGML files on command-line, **SAX-based**, i.e. can handle big files [[http://www.ltg.ed.ac.uk/software/xml/|LT XML]] - tools like sggrep, sgcount, knit... for handling SGML files on command-line, **SAX-based**, i.e. can handle big files
- 

[ Back to the navigation ] [ Back to the content ]