Table of Contents

PML Haters' Guide

(Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.)

Inspired by Unix Hater's Handbook (please don't feel offended by the title of this page or the book; read the book's preface and anti-foreword), I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.

Links to additional tools are at the bottom of the page.

I strongly recommend Lutz Prechelt: Why We Need an Explicit Forum for Negative Results (esp. section 2.6) for an explanation why not speaking about weaknesses hurts.

In Spite of some Common Assumptions...

Unlike most researchers at UFAL, I need to handle big collections of sentences. My PML files are this big, e.g. several thousand sentences in a file.

Please strongly prefer SAX-based tools to DOM-based tools.

Validation

Given a PML file, how do I validate it?

For most purposes, a libxml2 (DOM) based validator

/f/common/exec/validate_pml --pml-dir /f/common/share/pml --path /f/common/share/tred file_to_validate

should work fine and fast.

For huge files, use

/f/common/exec/validate_pml_stream --path /f/common/share/tred file_to_validate

which is based on Jing (SAX); Jing has no Zlib or stdin support, so some space in /tmp will be needed for temporary files.
Both scripts have decent user documentation. See inside the scripts if interested in the implementation details.

XSH Won't Work: Blame XML Namespaces

Say we have a file.t.xml and we want to browse it in XSH.

xsh -i
$scratch/> $f := open "file.t.xml"
$f/> ls
<?xml version='1.0' encoding='utf-8'?>
<tdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">...</tdata>

Found 1 node(s).
$f/>cd tdata
Trying to change current node to an undefined value
 at <STDIN> line 1, column 8,
$f/>

The reason why cd tdata won't work is a badly specified XML namespace. This works:

$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/;
$f/>cd pml:tdata
$f/pml:tdata>

Hint: add the regns command to your ~/.xsh2rc.

You will have to write the pml: prefix before every tag name in every XPath!

Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. You'll need to read XML (Namespaces) specification.

Number of Sentences

Given a PML file (say t-layer), how do I count the number of sentences in the file?

Some design decisions (and I would call them bad decisions) in PML make this simple question challenging. Each sentence is stored in a <LM>…</LM> element, but the same element is used for lists of nodes' children, too. So you have to count while being aware of XML structure.

This XPath would quickly give you the number of sentences:

count(pml:tdata/pml:trees/pml:LM)

…if only there were an interpreter that would not need to load and parse the whole file.

This is how to use XSH on command line to evaluate the query:

cat file.t.xml \
| xsh -I - -C "regns pml http://ufal.mff.cuni.cz/pdt/pml/; count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null

or just the following, if you have the regns command in your ~/.xsh2rc:

cat file.t.xml \
| xsh -I - -C "count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null

LT XML's sggrep allows a shorter notation:

cat file.t.xml | sggrep '/tdata/trees/LM' | grep '^<LM' | wc -l

The performance is comparable, on about 60k sentences (in about 8 gzipped files) the tools needed:

        LT XML       XSH           compare with 'wc -l' if we got rid of XML
real    0m50.541s    0m58.882s     0m1.371s
user    0m55.828s    0m50.744s     0m1.470s
sys     0m4.284s     0m7.867s      0m0.250s

Here is a one-liner in Perl that does not load the whole file into memory:

perl -MXML::LibXML::Reader -e 'my $r=XML::LibXML::Reader->new(location=>shift); $r->nextElement("trees"); $d=$r->depth; $r->read; while ($d<$r->depth) {$i++; $r->nextSibling} print $i,"\n"' file.t.xml

Restricting a Suite of PML Files to Contain only a Specific Sentence

Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, file-m.xml, file-a.xml, file-t.xml) containing annotation of some 5000 sentences. I know the bug occurs in sentence 345.

How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? A XML-Reader based script by Petr Pajas demonstrates that:

~pajas/projects/pml/tools/separate_t_tree.pl file-t.xml 345

Creating such a suite is problematic because there can exist links from sentence 345 to previous sentences (from t-layer to a-layer for elided words, within t-layer for coreference). The above mentioned script does not take this issue into account.

XSH - XML editing shell by Petr Pajas. DOM-based, i.e. reads in whole file!

LT XML - tools like sggrep, sgcount, knit… for handling SGML files on command-line, SAX-based, i.e. can handle big files