Displaying Giza Alignments in Tred
Note: Tred is a dependency-tree editor. It was not designed to work with bilingual alignment. However, since Tred is highly customizable, displaying anything where one needs to connect words (nodes) can in theory be accomplished with Tred. (And note the emphasis on displaying, as opposed to editing.)
Tred can be forced to visualise bilingual alignment, which is useful for error analysis in machine translation. If you can easily see what alignments you get wrong, you may be able to develop better alignment (and ⇒ translation) system.
This page provides a simple Perl script called align2fs.pl
(/home/zeman/projekty/hiero/svncode/visualization-tools/align2fs.pl
). It reads Giza alignments from the standard input and writes a FS file (trees for Tred) to the standard output. The program does not need any arguments (if there are arguments, they are interpreted as paths to files that contain the input, instead of the standard input stream).
The input has three lines per aligned segment: source language words, target language words and alignment description. Example:
上海 浦东 开发 与 法制 建设 同步 the development of shanghai 's pudong is in step with the establishment of its legal system 0-3 1-5 2-1 3-9 3-10 4-14 4-15 5-13 6-8
align2fs.pl < alignment.giza > alignment.fs tred alignment.fs
And this is how the output looks like in Tred:
The Script
And here is the source code of the script:
#!/usr/bin/perl # Converts GIZA alignments into Tred FS format, to visualise them in Tred. # (Note: Tred is primarily a dependency tree editor but fortunately it can be forced to display even alignments.) # (c) 2007 Dan Zeman <zeman@ufal.mff.cuni.cz> # License: GNU GPL use utf8; use open ":utf8"; binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); # Write the FS header. print <<EOF \@P word \@O word \@P index \@O index \@P align \@O align \@P gloss \@O gloss \@VA caption \@P caption \@O caption \@N ord EOF ; # Read input GIZA alignments, write FS trees. while(<>) { # We just read the source language segment. my $src = $_; # We have to read two more lines: the target language segment and the alignment. my $tgt = <>; my $align = <>; # Escape any characters special to FS format. $src =~ s/([,\[\]\|\\])/\\$1/g; $tgt =~ s/([,\[\]\|\\])/\\$1/g; # Convert source segment to array of words (nodes). $src =~ s/^\s+//; $src =~ s/\s+$//; my @src = split(/\s+/, $src); # Convert target segment to array of words (nodes). $tgt =~ s/^\s+//; $tgt =~ s/\s+$//; my @tgt = split(/\s+/, $tgt); # Convert alignments to array. $align =~ s/^\s+//; $align =~ s/\s+$//; my @align = split(/\s+/, $align); # Prepare a tree for Tred. All source nodes should be on level one, all target words on level two. # Thus we make all source nodes dependents of root, all target nodes dependents of the last source node. # Also, the ordering of the nodes indicated in the ord attribute should mix source and target nodes # until one or the other are exhausted. Otherwise, all target words would be displayed to the right # of the rightmost source word. my $current = -1; my @srcindex; my @tgtindex; my $max = $#tgt>$#src ? $#tgt : $src; for(my $i = 0; $i<=$max; $i++) { if($i<=$#src) { $current++; # Remember mix index of ith source word. $srcindex[$i] = $current; } if($i<=$#tgt) { $current++; # Remember mix index of ith target word. $tgtindex[$i] = $current; } } # Print tree. # The "caption" attribute will be displayed in the tree caption window but not at the nodes. # We use it to store the original sentence. my $caption = join(" ", (@src, "\\|", @tgt)); print("[root,caption=$caption]("); # Print all source nodes (level one). my @srcout; for(my $i = 0; $i<=$#src; $i++) { # Find alignments originating here. my @algrp = grep {m/^$i-/} @align; my $align = join("\\,", @algrp); my $gloss = join(" ", map {m/-(\d+)/; $tgt[$1]} @algrp); my %rec = ( word => $src[$i], index => $i, align => $align, gloss => $gloss, ord => $srcindex[$i], ); push(@srcout, \%rec); } print(join(",", map {"[$_->{word},$_->{index},$_->{align},$_->{gloss},ord=$_->{ord}]"} @srcout), "("); # Print all target nodes (level two) and close both levels. my @tgtout; for(my $i = 0; $i<=$#tgt; $i++) { # Find alignments ending here. my @algrp = grep {m/-$i$/} @align; my $align = join("\\,", @algrp); my $gloss = join(" ", map {m/(\d+)-/; $src[$1]} @algrp); my %rec = ( word => $tgt[$i], index => $i, align => $align, gloss => $gloss, ord => $tgtindex[$i], ); push(@tgtout, \%rec); } print(join(",", map {"[$_->{word},$_->{index},$_->{align},$_->{gloss},ord=$_->{ord}]"} @tgtout), "))\n"); } # Print footer of the FS file with Tred formatting instructions. print("\n"); # By default, the word attribute is displayed. print("//Tred:Custom-Attribute:node: \${word}\n"); # By default, the index attribute is displayed next. print("//Tred:Custom-Attribute:node: \${index}\n"); # By default, the align attribute is displayed next. Comment the print statement if you want it hidden by default. print("//Tred:Custom-Attribute:node: \${align}\n"); # By default, the gloss attribute is displayed next in blue. print("//Tred:Custom-Attribute:node: #{blue}\${gloss}\n"); # We have our own alignment links and do not want to see the artificial tree edges. # Let's try to draw our own links. # xn can be computed: x[attribute_name = value] # the whole thing can be returned by something like <?if(\$\${root}){"#{Line-fill:white}"}?> # perl code: # my $coor = join("&", map {s/-.*//; "xn,yn,x[index=$_],y[index=$_]"} split(/,/, $${align})); # style instruction with code evaluation: # <?if($${lang} ne "src") {"#{Line-coords:".join("&", map {s/-.*//; "xn,yn,x[index=$_],y[index=$_]"} split(/,/, $${align}))."}"}?> print("//Tred:Custom-Attribute:style: <?if(\$\${align} ne \"\"){\"#{Line-coords:\".join(\"&\", map {s/-.*//; \"xn,yn,x[index=\$_],y[index=\$_]\"} split(/,/, \$\${align})).\"}\"} else {\"#{Line-fill:white}\"}?>\n");