Differences

This shows you the differences between two versions of the page.

--- pml-haters [2007/05/02 04:37]
bojar
+++ pml-haters [2007/06/05 06:35] (current)
bojar jen formatovani
@@ Line 3: / Line 3: @@
 (Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.)
-Inspired by [[http://www.simson.net/ref/ugh.pdf|Unix Hater's Handbook]], I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.
+Inspired by [[http://www.simson.net/ref/ugh.pdf|Unix Hater's Handbook]] (please don't feel offended by the title of this page or the book; read the book's preface and anti-foreword), I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.
 Links to additional tools are at the bottom of the page.
+I strongly recommend [[http://www.jucs.org/jucs_3_9/why_we_need_an/Prechelt_L.pdf|Lutz Prechelt: Why We Need an Explicit Forum for Negative Results]] (esp. section 2.6) for an explanation why not speaking about weaknesses hurts.
 ===== In Spite of some Common Assumptions... =====
@@ Line 16: / Line 18: @@
 ===== Validation =====
-Given a PML file, how do I validate it? I always forget... Please provide me with the one-liner to do the validation.
+Given a PML file, how do I validate it?
+For most purposes, a libxml2 (DOM) based validator
+<code>/f/common/exec/validate_pml --pml-dir /f/common/share/pml --path /f/common/share/tred file_to_validate</code>should work fine and fast.
+For huge files, use<code>/f/common/exec/validate_pml_stream --path /f/common/share/tred file_to_validate</code>which is based on Jing (SAX); Jing has no Zlib or stdin support, so some space in /tmp will be needed for temporary files.
+Both scripts have decent user documentation. See inside the scripts if interested in the implementation details.
 ===== XSH Won't Work: Blame XML Namespaces =====
@@ Line 39: / Line 45: @@
 The reason why ''cd tdata'' won't work is a badly specified XML namespace. This works:
-<code>$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/
+<code>$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/;
 $f/>cd pml:tdata
 $f/pml:tdata>
 </code>
-You will have to write the ''pml:'' prefix before every tag name in every XPath!
+Hint: add the regns command to your ~/.xsh2rc.
-Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. I hate XML and will never stop hating it!
+You will have to write the ''pml:'' prefix before every tag name in every XPath!
+Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. You'll need to read XML (Namespaces) specification.
 ===== Number of Sentences =====
@@ Line 60: / Line 64: @@
 This XPath would quickly give you the number of sentences:
-<code>count(tdata/trees/LM)
+<code>count(pml:tdata/pml:trees/pml:LM)
 </code>
@@ Line 69: / Line 73: @@
 <code>cat file.t.xml \
 | xsh -I - -C "regns pml http://ufal.mff.cuni.cz/pdt/pml/; count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null
-<code>
+</code>
+or just the following, if you have the regns command in your ~/.xsh2rc:
+<code>cat file.t.xml \
+| xsh -I - -C "count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null
+</code>
 LT XML's sggrep allows a shorter notation:
@@ Line 76: / Line 86: @@
 </code>
-The performance is comparable, on about 60k sentences (in about 8 files) the tools needed:
+The performance is comparable, on about 60k sentences (in about 8 gzipped files) the tools needed:
-<code>        LT XML       XSH
-real    0m50.541s    0m58.882s
+<code>        LT XML       XSH           compare with 'wc -l' if we got rid of XML
-user    0m55.828s    0m50.744s
+real    0m50.541s    0m58.882s     0m1.371s
-sys     0m4.284s     0m7.867s
+user    0m55.828s    0m50.744s     0m1.470s
+sys     0m4.284s     0m7.867s      0m0.250s
+</code>
+Here is a one-liner in Perl that does not load the whole file into memory:
+<code>perl -MXML::LibXML::Reader -e 'my $r=XML::LibXML::Reader->new(location=>shift); $r->nextElement("trees"); $d=$r->depth; $r->read; while ($d<$r->depth) {$i++; $r->nextSibling} print $i,"\n"' file.t.xml
 </code>
@@ Line 87: / Line 103: @@
 Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, file-m.xml, file-a.xml, file-t.xml) containing annotation of some 5000 sentences. I know the bug occurs in sentence 345.
-How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced?
+How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? A XML-Reader based script by Petr Pajas demonstrates that:
+<code>~pajas/projects/pml/tools/separate_t_tree.pl file-t.xml 345
+</code>
+Creating such a suite is problematic because there can exist links from sentence 345 to previous sentences (from t-layer to a-layer for elided words, within t-layer for coreference). The above mentioned script does not take this issue into account.
 ===== Links to Useful Tools =====
@@ Line 95: / Line 115: @@
 [[http://www.ltg.ed.ac.uk/software/xml/|LT XML]] - tools like sggrep, sgcount, knit... for handling SGML files on command-line, **SAX-based**, i.e. can handle big files

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences