Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
padt:start [2011/06/30 23:59] smrz |
padt:start [2013/05/30 12:34] zeman Zalámání vět. |
| |
http://ufal.mff.cuni.cz/padt/online/ | http://ufal.mff.cuni.cz/padt/online/ |
| |
===== Overview ===== | |
| |
===== Setup ===== | ===== Setup ===== |
Install [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] including the [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/padt/documentation/|padt]] and [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/elixir/documentation/|elixir]] extensions from the default TrEd repository http://ufal.mff.cuni.cz/~pajas/tred/extensions/. | Install [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] including the [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/padt/documentation/|padt]] and [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/elixir/documentation/|elixir]] extensions from the default TrEd repository http://ufal.mff.cuni.cz/~pajas/tred/extensions/. |
| |
The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/. A working copy is accessible at /net/projects/ace/data/arabic/PADT/ on the UFAL network. | The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/ (see also [[https://svn.ms.mff.cuni.cz/trac/padt|Trac]]). A working copy is accessible at ''/net/projects/padt'' on the ÚFAL network. |
| |
The project's data are stored in the main subdirectory ''data'', which is split further into ''Prague'', ''Penn'', and ''ElixirFM'', explained below. | The project's data are stored in the main subdirectory ''data'', which is split further into ''Prague'', ''Penn'', and ''ElixirFM'', explained below. |
data/Prague/XIN/ | data/Prague/XIN/ |
| |
The project's contributors are ''smrz'', ''bielicky'', and ''zabokrtsky'', the rest of ''ufal'' have just the read rights. | The project's contributors are ''smrz'', ''bielicky'', ''zabokrtsky'' and ''zeman'', the rest of ''ufal'' have just the read rights. |
| |
There is also the 'tools' directory which contains some useful scripts. | There is also the 'tools' directory which contains some useful scripts. |
| |
The code base for the PADT project, i.e. for annotation, display, and processing of the data, is the TrEd's ''padt'' extension, and its ''elixir'' extension that is a dependency for ''padt''. | The code base for the PADT project, i.e. for annotation, display, and processing of the data, is the TrEd's ''padt'' extension, and its ''elixir'' extension that is a dependency for ''padt''. |
| |
===== Agenda ===== | ===== Agenda ===== |
| |
| * Write a block to read the PADT 2.0 data in Treex. An XML schema is needed. |
| * Jak je to teď se zalámáním vět? Bude se nějak využívat prvek Unit? Současné stromy zatím pořád odpovídají odstavcům, s průměrným počtem 38 tokenů na strom. Treebank obsahuje 874 souborů (dokumentů), 7664 stromů (odstavců) a 289910 tokenů (nekořenových uzlů). Token je menší jednotka než slovo, přičemž je možné dohledat, které tokeny tvořily dohromady jedno slovo (týká se druhé tokenizace v rámci morfologické analýzy; odlepení interpunkce od slov je něco jiného). |
| |
Focus on paragraphs/sentences that miss PADT-Morpho annotation, esp. non-annotated headlines: | Focus on paragraphs/sentences that miss PADT-Morpho annotation, esp. non-annotated headlines: |
| |
| |
There are some other task that have been partially solved, but need to be refreshed and completed: | There are some other tasks that have been partially solved in PADT, but need to be refreshed and completed: |
| |
* Retrain the CRF++ model for tagging selected morphological categories and apply it to prune remaining morphological ambiguities. | * Retrain the CRF++ model for tagging selected morphological categories and apply it to prune remaining morphological ambiguities. |
* Refresh and improve the code and rules for converting PATB phrase syntax trees into dependency trees a la PADT. | * Refresh and improve the code and rules for converting PATB phrase syntax trees into dependency trees a la PADT. |
* Update PADT::Syntax annotation context (level synchronization, non-conflicting bindings). | * Update PADT::Syntax annotation context (level synchronization, non-conflicting bindings). |
* Update PADT::Deeper annotation context (level synchronization, working schemas, modern stylesheets, non-conflicting bindings). | * Update PADT::Deeper annotation context (level synchronization, working schemas, modern stylesheets, non-conflicting bindings). |
* Improve documentation. | * Improve documentation. |
| |
===== References ===== | ===== References ===== |