Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
padt:start [2013/05/27 17:25] zeman |
padt:start [2013/05/30 12:34] zeman Zalámání vět. |
Install [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] including the [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/padt/documentation/|padt]] and [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/elixir/documentation/|elixir]] extensions from the default TrEd repository http://ufal.mff.cuni.cz/~pajas/tred/extensions/. | Install [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] including the [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/padt/documentation/|padt]] and [[http://ufal.mff.cuni.cz/~pajas/tred/extensions/elixir/documentation/|elixir]] extensions from the default TrEd repository http://ufal.mff.cuni.cz/~pajas/tred/extensions/. |
| |
The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/. A working copy is accessible at ''/net/projects/padt'' on the ÚFAL network. | The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/ (see also [[https://svn.ms.mff.cuni.cz/trac/padt|Trac]]). A working copy is accessible at ''/net/projects/padt'' on the ÚFAL network. |
| |
The project's data are stored in the main subdirectory ''data'', which is split further into ''Prague'', ''Penn'', and ''ElixirFM'', explained below. | The project's data are stored in the main subdirectory ''data'', which is split further into ''Prague'', ''Penn'', and ''ElixirFM'', explained below. |
data/Prague/XIN/ | data/Prague/XIN/ |
| |
The project's contributors are ''smrz'', ''bielicky'', and ''zabokrtsky'', the rest of ''ufal'' have just the read rights. | The project's contributors are ''smrz'', ''bielicky'', ''zabokrtsky'' and ''zeman'', the rest of ''ufal'' have just the read rights. |
| |
There is also the 'tools' directory which contains some useful scripts. | There is also the 'tools' directory which contains some useful scripts. |
| |
The code base for the PADT project, i.e. for annotation, display, and processing of the data, is the TrEd's ''padt'' extension, and its ''elixir'' extension that is a dependency for ''padt''. | The code base for the PADT project, i.e. for annotation, display, and processing of the data, is the TrEd's ''padt'' extension, and its ''elixir'' extension that is a dependency for ''padt''. |
| |
===== Agenda ===== | ===== Agenda ===== |
| |
| * Write a block to read the PADT 2.0 data in Treex. An XML schema is needed. |
| * Jak je to teď se zalámáním vět? Bude se nějak využívat prvek Unit? Současné stromy zatím pořád odpovídají odstavcům, s průměrným počtem 38 tokenů na strom. Treebank obsahuje 874 souborů (dokumentů), 7664 stromů (odstavců) a 289910 tokenů (nekořenových uzlů). Token je menší jednotka než slovo, přičemž je možné dohledat, které tokeny tvořily dohromady jedno slovo (týká se druhé tokenizace v rámci morfologické analýzy; odlepení interpunkce od slov je něco jiného). |
| |
Focus on paragraphs/sentences that miss PADT-Morpho annotation, esp. non-annotated headlines: | Focus on paragraphs/sentences that miss PADT-Morpho annotation, esp. non-annotated headlines: |