Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
padt:start [2011/05/31 11:21] smrz |
padt:start [2013/05/30 12:34] zeman Zalámání vět. |
| |
http://ufal.mff.cuni.cz/padt/online/ | http://ufal.mff.cuni.cz/padt/online/ |
| |
===== Overview ===== | |
| |
===== Setup ===== | ===== Setup ===== |
Install [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] including the [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/padt/documentation/|padt]] and [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/elixir/documentation/|elixir]] extensions from the default TrEd repository http://ufal.mff.cuni.cz/~pajas/tred/extensions/. | Install [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] including the [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/padt/documentation/|padt]] and [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/elixir/documentation/|elixir]] extensions from the default TrEd repository http://ufal.mff.cuni.cz/~pajas/tred/extensions/. |
| |
The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/. A working copy is accessible at /net/projects/ace/data/arabic/PADT/ on the UFAL network. | The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/ (see also [[https://svn.ms.mff.cuni.cz/trac/padt|Trac]]). A working copy is accessible at ''/net/projects/padt'' on the ÚFAL network. |
| |
The project's data are stored in the main subdirectory ''data'', which is split further into ''Prague'', ''Penn'', and ''ElixirFM'', explained below. | The project's data are stored in the main subdirectory ''data'', which is split further into ''Prague'', ''Penn'', and ''ElixirFM'', explained below. |
tred /net/projects/ace/data/arabic/PADT/data/Prague/AEP/UMH_ARB_20040407.0001.{morpho,syntax}.pml | tred /net/projects/ace/data/arabic/PADT/data/Prague/AEP/UMH_ARB_20040407.0001.{morpho,syntax}.pml |
</code> | </code> |
| |
| For improved quality of display of the various scripts and trees types, you can use the following setup in TrEd's config file, or similar: |
| |
| <file> |
| Font = "family:DejaVu Sans Condensed, size:14, weight:normal" |
| |
| NodeXSkip = 30; |
| NodeYSkip = 10; |
| </file> |
| |
===== Locations ===== | ===== Locations ===== |
data/Prague/XIN/ | data/Prague/XIN/ |
| |
The project's contributors are ''smrz'', ''bielicky'', and ''zabokrtsky'', the rest of ''ufal'' have just the read rights. | The project's contributors are ''smrz'', ''bielicky'', ''zabokrtsky'' and ''zeman'', the rest of ''ufal'' have just the read rights. |
| |
There is also the 'tools' directory which contains some useful scripts. | There is also the 'tools' directory which contains some useful scripts. |
| |
===== Agenda ===== | ===== Agenda ===== |
| |
| * Write a block to read the PADT 2.0 data in Treex. An XML schema is needed. |
| * Jak je to teď se zalámáním vět? Bude se nějak využívat prvek Unit? Současné stromy zatím pořád odpovídají odstavcům, s průměrným počtem 38 tokenů na strom. Treebank obsahuje 874 souborů (dokumentů), 7664 stromů (odstavců) a 289910 tokenů (nekořenových uzlů). Token je menší jednotka než slovo, přičemž je možné dohledat, které tokeny tvořily dohromady jedno slovo (týká se druhé tokenizace v rámci morfologické analýzy; odlepení interpunkce od slov je něco jiného). |
| |
Focus on paragraphs/sentences that miss PADT-Morpho annotation, esp. non-annotated headlines: | Focus on paragraphs/sentences that miss PADT-Morpho annotation, esp. non-annotated headlines: |
| |
<code bash> | <code bash> |
btred -QTe '' {Penn,Prague}/???/*.morpho*.pml | btred -QTe '@w = $this->children(); @n = grep { $_->children() } @w; print ThisAddress() . "\n" if @n < 0.9 * @w' Penn/???/*.morpho*.pml |
</code> | </code> |
| |
</code> | </code> |
| |
===== References ===== | |
| |
| There are some other tasks that have been partially solved in PADT, but need to be refreshed and completed: |
| |
| * Retrain the CRF++ model for tagging selected morphological categories and apply it to prune remaining morphological ambiguities. |
| * Refresh and improve the code and rules for converting PATB phrase syntax trees into dependency trees a la PADT. |
| * Update PADT::Syntax annotation context (level synchronization, non-conflicting bindings). |
| * Update PADT::Deeper annotation context (level synchronization, working schemas, modern stylesheets, non-conflicting bindings). |
| * Improve documentation. |
| |
| ===== References ===== |