This is an old revision of the document!
Table of Contents
PML Haters' Guide
(Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.)
Inspired by Unix Hater's Handbook, I am starting this wiki page to collect tips on some basic operations with the not-so-basic PML data format. Please answer any of my unanswered questions and feel free to open new questions.
Links to additional tools are at the bottom of the page.
In Spite of some Common Assumptions...
Unlike most researchers at UFAL, I need to handle big collections of sentences. My PML files are this big, e.g. several thousand sentences in a file.
Please strongly prefer SAX-based tools to DOM-based tools.
Validation
Given a PML file, how do I validate it? I always forget… Please provide me with the one-liner to do the validation.
XSH Won't Work: I HATE XML NAMESPACES
Say we have a file.t.xml
and we want to browse it in XSH.
xsh -i $scratch/> $f := open "file.t.xml" $f/> ls <?xml version='1.0' encoding='utf-8'?> <tdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">...</tdata> Found 1 node(s). $f/>cd tdata Trying to change current node to an undefined value at <STDIN> line 1, column 8, $f/>
The reason why cd tdata
won't work is probably a badly specified XML namespace. Please let me know how to switch the default namespace in XSH, I have wasted two hours on this already.
Number of Sentences
Given a PML file (say t-layer), how do I count the number of sentences in the file?
Some design decisions (and I would call them bad decisions) in PML make this simple question challenging. Each sentence is stored in a <LM>…</LM>
element, but the same element is used for lists of nodes' children, too. So you have to count while being aware of XML structure.
Restricting a Suite of PML Files to Contain only a Specific Sentence
Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, file-m.xml, file-a.xml, file-t.xml) containing annotation of some 5000 sentences. I know the bug occurs in sentence 345.
How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced?