Prague Arabic Dependency Treebank

http://ufal.mff.cuni.cz/padt/online/

Setup

Install TrEd including the padt and elixir extensions from the default TrEd repository http://ufal.mff.cuni.cz/~pajas/tred/extensions/.

The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/ (see also Trac). A working copy is accessible at /net/projects/padt on the ÚFAL network.

The project's data are stored in the main subdirectory data, which is split further into Prague, Penn, and ElixirFM, explained below.

Try opening a PADT file to check if your setup is complete. Run TrEd and open the following files. They should automatically set their editing contexts and stylesheets to PADT::Morpho and PADT::Syntax, respectively:

tred /net/projects/ace/data/arabic/PADT/data/Prague/AEP/UMH_ARB_20040407.0001.{morpho,syntax}.pml

For improved quality of display of the various scripts and trees types, you can use the following setup in TrEd's config file, or similar:

Font = "family:DejaVu Sans Condensed, size:14, weight:normal"

NodeXSkip = 30;
NodeYSkip = 10;

Locations

The SVN repository of the PADT project is https://svn.ms.mff.cuni.cz/svn/padt/. The main subdirectory data is split into ElixirFM, Prague, and Penn. Further:

data/ElixirFM/

data/Penn/1v3/
data/Penn/2v2/
data/Penn/3v2/
data/Penn/4v1/

data/Prague/AEP/
data/Prague/ASB/
data/Prague/EAT/
data/Prague/HYT/
data/Prague/NHR/
data/Prague/XIN/

The project's contributors are smrz, bielicky, zabokrtsky and zeman, the rest of ufal have just the read rights.

There is also the 'tools' directory which contains some useful scripts.

The code base for the PADT project, i.e. for annotation, display, and processing of the data, is the TrEd's padt extension, and its elixir extension that is a dependency for padt.

Agenda

Write a block to read the PADT 2.0 data in Treex. An XML schema is needed.
Jak je to teď se zalámáním vět? Bude se nějak využívat prvek Unit? Současné stromy zatím pořád odpovídají odstavcům, s průměrným počtem 38 tokenů na strom. Treebank obsahuje 874 souborů (dokumentů), 7664 stromů (odstavců) a 289910 tokenů (nekořenových uzlů). Token je menší jednotka než slovo, přičemž je možné dohledat, které tokeny tvořily dohromady jedno slovo (týká se druhé tokenizace v rámci morfologické analýzy; odlepení interpunkce od slov je něco jiného).

Focus on paragraphs/sentences that miss PADT-Morpho annotation, esp. non-annotated headlines:

btred -QTe '@w = $this->children(); @n = grep { $_->children() } @w; print ThisAddress() . "\n" if @n < 0.9 * @w' Penn/???/*.morpho*.pml

Focus on nodes in PADT-Syntax that do not have a valid afun annotation:

btred -QTNe 'print ThisAddress() . "\n" if exists $this->{"afun"} and $this->{"afun"} eq "???"' Prague/???/*.syntax*.pml

There are some other tasks that have been partially solved in PADT, but need to be refreshed and completed:

Retrain the CRF++ model for tagging selected morphological categories and apply it to prune remaining morphological ambiguities.
Refresh and improve the code and rules for converting PATB phrase syntax trees into dependency trees a la PADT.
Update PADT::Syntax annotation context (level synchronization, non-conflicting bindings).
Update PADT::Deeper annotation context (level synchronization, working schemas, modern stylesheets, non-conflicting bindings).
Improve documentation.

References

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Prague Arabic Dependency Treebank

Setup

Locations

Agenda

References