Differences

This shows you the differences between two versions of the page.

--- pml-haters [2007/05/02 02:57]
bojar vytvořeno
+++ pml-haters [2007/06/05 06:35] (current)
bojar jen formatovani
@@ Line 3: / Line 3: @@
 (Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.)
-Inspired by [[http://www.simson.net/ref/ugh.pdf|Unix Hater's Handbook]], I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.
+Inspired by [[http://www.simson.net/ref/ugh.pdf|Unix Hater's Handbook]] (please don't feel offended by the title of this page or the book; read the book's preface and anti-foreword), I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.
 Links to additional tools are at the bottom of the page.
+I strongly recommend [[http://www.jucs.org/jucs_3_9/why_we_need_an/Prechelt_L.pdf|Lutz Prechelt: Why We Need an Explicit Forum for Negative Results]] (esp. section 2.6) for an explanation why not speaking about weaknesses hurts.
 ===== In Spite of some Common Assumptions... =====
@@ Line 12: / Line 14: @@
 Please strongly prefer SAX-based tools to DOM-based tools.
 ===== Validation =====
-Given a PML file, how do I validate it? I always forget... Please provide me with the one-liner to do the validation.
+Given a PML file, how do I validate it?
+For most purposes, a libxml2 (DOM) based validator
+<code>/f/common/exec/validate_pml --pml-dir /f/common/share/pml --path /f/common/share/tred file_to_validate</code>should work fine and fast.
+For huge files, use<code>/f/common/exec/validate_pml_stream --path /f/common/share/tred file_to_validate</code>which is based on Jing (SAX); Jing has no Zlib or stdin support, so some space in /tmp will be needed for temporary files.
+Both scripts have decent user documentation. See inside the scripts if interested in the implementation details.
+===== XSH Won't Work: Blame XML Namespaces =====
+Say we have a ''file.t.xml'' and we want to browse it in XSH.
+<code>xsh -i
+$scratch/> $f := open "file.t.xml"
+$f/> ls
+<?xml version='1.0' encoding='utf-8'?>
+<tdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">...</tdata>
+Found 1 node(s).
+$f/>cd tdata
+Trying to change current node to an undefined value
+ at <STDIN> line 1, column 8,
+$f/>
+</code>
+The reason why ''cd tdata'' won't work is a badly specified XML namespace. This works:
+<code>$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/;
+$f/>cd pml:tdata
+$f/pml:tdata>
+</code>
+Hint: add the regns command to your ~/.xsh2rc.
+You will have to write the ''pml:'' prefix before every tag name in every XPath!
+Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. You'll need to read XML (Namespaces) specification.
 ===== Number of Sentences =====
 Given a PML file (say t-layer), how do I count the number of sentences in the file?
+Some design decisions (and I would call them bad decisions) in PML make this simple question challenging. Each sentence is stored in a ''<LM>...</LM>'' element, but the same element is used for lists of nodes' children, too. So you have to count while **being aware of XML structure**.
+This XPath would quickly give you the number of sentences:
+<code>count(pml:tdata/pml:trees/pml:LM)
+</code>
+...if only there were an interpreter that would not need to load and parse the whole file.
+This is how to use XSH on command line to evaluate the query:
+<code>cat file.t.xml \
+| xsh -I - -C "regns pml http://ufal.mff.cuni.cz/pdt/pml/; count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null
+</code>
+or just the following, if you have the regns command in your ~/.xsh2rc:
+<code>cat file.t.xml \
+| xsh -I - -C "count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null
+</code>
+LT XML's sggrep allows a shorter notation:
+<code>cat file.t.xml | sggrep '/tdata/trees/LM' | grep '^<LM' | wc -l
+</code>
+The performance is comparable, on about 60k sentences (in about 8 gzipped files) the tools needed:
+<code>        LT XML       XSH           compare with 'wc -l' if we got rid of XML
+real    0m50.541s    0m58.882s     0m1.371s
+user    0m55.828s    0m50.744s     0m1.470s
+sys     0m4.284s     0m7.867s      0m0.250s
+</code>
+Here is a one-liner in Perl that does not load the whole file into memory:
+<code>perl -MXML::LibXML::Reader -e 'my $r=XML::LibXML::Reader->new(location=>shift); $r->nextElement("trees"); $d=$r->depth; $r->read; while ($d<$r->depth) {$i++; $r->nextSibling} print $i,"\n"' file.t.xml
+</code>
 ===== Restricting a Suite of PML Files to Contain only a Specific Sentence =====
@@ Line 25: / Line 103: @@
 Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, file-m.xml, file-a.xml, file-t.xml) containing annotation of some 5000 sentences. I know the bug occurs in sentence 345.
-How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced?
+How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? A XML-Reader based script by Petr Pajas demonstrates that:
+<code>~pajas/projects/pml/tools/separate_t_tree.pl file-t.xml 345
+</code>
+Creating such a suite is problematic because there can exist links from sentence 345 to previous sentences (from t-layer to a-layer for elided words, within t-layer for coreference). The above mentioned script does not take this issue into account.
 ===== Links to Useful Tools =====
-[[http://www.ltg.ed.ac.uk/software/xml/|LT XML]] - tools like sggrep, sgcount, knit... for handling SGML files on command-line
+[[http://xsh.sourceforge.net/| XSH]] - XML editing shell by Petr Pajas. **DOM-based**, i.e. reads in **whole file!**
+[[http://www.ltg.ed.ac.uk/software/xml/|LT XML]] - tools like sggrep, sgcount, knit... for handling SGML files on command-line, **SAX-based**, i.e. can handle big files

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences