Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
pml-haters [2007/05/02 03:07] bojar |
pml-haters [2007/06/05 06:35] (current) bojar jen formatovani |
||
---|---|---|---|
Line 3: | Line 3: | ||
(Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.) | (Stránku zkusím psát anglicky. Myslím, že by se mohla hodit i mezinárodnímu publiku, jestli s PML prorazíme.) | ||
- | Inspired by [[http:// | + | Inspired by [[http:// |
Links to additional tools are at the bottom of the page. | Links to additional tools are at the bottom of the page. | ||
+ | |||
+ | I strongly recommend [[http:// | ||
===== In Spite of some Common Assumptions... ===== | ===== In Spite of some Common Assumptions... ===== | ||
Line 12: | Line 14: | ||
Please strongly prefer SAX-based tools to DOM-based tools. | Please strongly prefer SAX-based tools to DOM-based tools. | ||
+ | |||
===== Validation ===== | ===== Validation ===== | ||
- | Given a PML file, how do I validate it? I always forget... Please provide me with the one-liner to do the validation. | + | Given a PML file, how do I validate it? |
+ | |||
+ | For most purposes, a libxml2 (DOM) based validator | ||
+ | < | ||
+ | |||
+ | For huge files, use< | ||
+ | Both scripts have decent user documentation. See inside | ||
+ | |||
+ | ===== XSH Won't Work: Blame XML Namespaces ===== | ||
+ | |||
+ | Say we have a '' | ||
+ | |||
+ | < | ||
+ | $scratch/> | ||
+ | $f/> ls | ||
+ | <?xml version=' | ||
+ | <tdata xmlns=" | ||
+ | |||
+ | Found 1 node(s). | ||
+ | $f/>cd tdata | ||
+ | Trying | ||
+ | at < | ||
+ | $f/> | ||
+ | </ | ||
+ | |||
+ | The reason why '' | ||
+ | |||
+ | < | ||
+ | $f/>cd pml:tdata | ||
+ | $f/ | ||
+ | </ | ||
+ | |||
+ | Hint: add the regns command to your ~/ | ||
+ | |||
+ | You will have to write the '' | ||
+ | |||
+ | Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. You'll need to read XML (Namespaces) specification. | ||
===== Number of Sentences ===== | ===== Number of Sentences ===== | ||
Given a PML file (say t-layer), how do I count the number of sentences in the file? | Given a PML file (say t-layer), how do I count the number of sentences in the file? | ||
+ | |||
+ | Some design decisions (and I would call them bad decisions) in PML make this simple question challenging. Each sentence is stored in a ''< | ||
+ | |||
+ | This XPath would quickly give you the number of sentences: | ||
+ | |||
+ | < | ||
+ | </ | ||
+ | |||
+ | ...if only there were an interpreter that would not need to load and parse the whole file. | ||
+ | |||
+ | This is how to use XSH on command line to evaluate the query: | ||
+ | |||
+ | < | ||
+ | | xsh -I - -C "regns pml http:// | ||
+ | </ | ||
+ | |||
+ | or just the following, if you have the regns command in your ~/.xsh2rc: | ||
+ | |||
+ | < | ||
+ | | xsh -I - -C " | ||
+ | </ | ||
+ | |||
+ | LT XML's sggrep allows a shorter notation: | ||
+ | |||
+ | < | ||
+ | </ | ||
+ | |||
+ | The performance is comparable, on about 60k sentences (in about 8 gzipped files) the tools needed: | ||
+ | |||
+ | < | ||
+ | real 0m50.541s | ||
+ | user 0m55.828s | ||
+ | sys | ||
+ | </ | ||
+ | |||
+ | Here is a one-liner in Perl that does not load the whole file into memory: | ||
+ | |||
+ | < | ||
+ | </ | ||
===== Restricting a Suite of PML Files to Contain only a Specific Sentence ===== | ===== Restricting a Suite of PML Files to Contain only a Specific Sentence ===== | ||
Line 25: | Line 103: | ||
Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, | Let's assume there is a bug in a script (a bug? impossible!) that handles a suite of files (file-w.xml, | ||
- | How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? | + | How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? |
+ | < | ||
+ | </ | ||
+ | |||
+ | Creating such a suite is problematic because there can exist links from sentence 345 to previous sentences (from t-layer to a-layer for elided words, within t-layer for coreference). The above mentioned script does not take this issue into account. | ||
===== Links to Useful Tools ===== | ===== Links to Useful Tools ===== | ||
Line 33: | Line 115: | ||
[[http:// | [[http:// | ||
- |