[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
pml-haters [2007/05/25 01:49]
bojar odkaz na pekny clanek se sirsim rozhledem
pml-haters [2007/05/31 14:43]
pajas
Line 14: Line 14:
  
 Please strongly prefer SAX-based tools to DOM-based tools. Please strongly prefer SAX-based tools to DOM-based tools.
- 
- 
  
 ===== Validation ===== ===== Validation =====
Line 21: Line 19:
 Given a PML file, how do I validate it? I always forget... Please provide me with the one-liner to do the validation. Given a PML file, how do I validate it? I always forget... Please provide me with the one-liner to do the validation.
  
-See [[user:ptacek:tectomt|this snippet]] for some vague hints.+For most purposes, a libxml2 (DOM) based validator 
 +<code>/f/common/exec/validate_pml --pml-dir /f/common/share/pml --path /f/common/share/tred file_to_validate</code>should work fine and fast. For huge files, use<code>/f/common/exec/validate_pml_stream --path /f/common/share/tred file_to_validate</code>which is based on Jing (SAX); Jing has no Zlib or stdin support, so some space in /tmp will be needed for temporary files. 
 +Both scripts have decent user documentation. See inside the scripts if interested in the implementation details.
  
 ===== XSH Won't Work: Blame XML Namespaces ===== ===== XSH Won't Work: Blame XML Namespaces =====
Line 42: Line 42:
 The reason why ''cd tdata'' won't work is a badly specified XML namespace. This works: The reason why ''cd tdata'' won't work is a badly specified XML namespace. This works:
  
-<code>$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/+<code>$f/>regns pml http://ufal.mff.cuni.cz/pdt/pml/;
 $f/>cd pml:tdata $f/>cd pml:tdata
 $f/pml:tdata> $f/pml:tdata>
 </code> </code>
 +
 +Hint: add the regns command to your ~/.xsh2rc.
  
 You will have to write the ''pml:'' prefix before every tag name in every XPath! You will have to write the ''pml:'' prefix before every tag name in every XPath!
  
 Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. You'll need to read XML (Namespaces) specification. Most probably you'll still face problems when accessing attributes of XML elements, because namespacing rules apply differently to attributes and elements. You'll need to read XML (Namespaces) specification.
- 
- 
- 
- 
- 
- 
- 
  
 ===== Number of Sentences ===== ===== Number of Sentences =====
Line 66: Line 61:
 This XPath would quickly give you the number of sentences: This XPath would quickly give you the number of sentences:
  
-<code>count(tdata/trees/LM)+<code>count(pml:tdata/pml:trees/pml:LM)
 </code> </code>
  
Line 75: Line 70:
 <code>cat file.t.xml \ <code>cat file.t.xml \
 | xsh -I - -C "regns pml http://ufal.mff.cuni.cz/pdt/pml/; count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null | xsh -I - -C "regns pml http://ufal.mff.cuni.cz/pdt/pml/; count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null
 +</code>
 +
 +or just the following, if you have the regns command in your ~/.xsh2rc:
 +
 +<code>cat file.t.xml \
 +| xsh -I - -C "count(pml:tdata/pml:trees/pml:LM)" 2>/dev/null
 </code> </code>
  
Line 90: Line 91:
 </code> </code>
  
 +Here is a one-liner in Perl that does not load the whole file into memory:
  
 +<code>perl -MXML::LibXML::Reader -e 'my $r=XML::LibXML::Reader->new(location=>shift); $r->nextElement("trees"); $d=$r->depth; $r->read; while ($d<$r->depth) {$i++; $r->nextSibling} print $i,"\n"' file.t.xml 
 +</code>
  
 ===== Restricting a Suite of PML Files to Contain only a Specific Sentence ===== ===== Restricting a Suite of PML Files to Contain only a Specific Sentence =====
Line 99: Line 102:
 How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? A XML-Reader based script by Petr Pajas demonstrates that: How do I create a suite of files with just the problematic sentence 345, i.e. files test-w.xml, test-m.xml, test-a.xml and test-t.xml, all properly referenced? A XML-Reader based script by Petr Pajas demonstrates that:
  
-<code>~pajas/projects/pml/separate_t_tree.pl file-t.xml 345+<code>~pajas/projects/pml/tools/separate_t_tree.pl file-t.xml 345
 </code> </code>
  
Line 109: Line 112:
  
 [[http://www.ltg.ed.ac.uk/software/xml/|LT XML]] - tools like sggrep, sgcount, knit... for handling SGML files on command-line, **SAX-based**, i.e. can handle big files [[http://www.ltg.ed.ac.uk/software/xml/|LT XML]] - tools like sggrep, sgcount, knit... for handling SGML files on command-line, **SAX-based**, i.e. can handle big files
- 

[ Back to the navigation ] [ Back to the content ]