[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

written by Dipanjan Das (Carnegie Mellon University) and Slav Petrov (Google Research)

spoken by Michal Novák

reported by Marie Konárová

Introduction

On Monday, October 31st 2011, Michal spoke about a paper which received the Best Paper award at the ACL 2011 conference.
The paper describes the benefits of using the unsupervised part-of-speach taggers in the case of not having labeled training data for the tagged language, but having a bilingual corpora with available labeling for the other language.

Notes

Presented paper focused on building a graph from a combination of a parallel corpus and a bigger monolingual corpus of non-labeled language. We discussed the possibility of using solely the parallel corpus and we agreed, that it may be possible. The steps of building the graph and the label propagation problem were clarified.

The authors of presented paper use monolingual similarity (the weights between the vertices of one language) and a bilingual similarity function (then weights between two different languages) for setting the weight over two vertices for the label propagation. Then we discussed the conclusion from the paper, that the label propagation brings more information about the language compared to the approach without propagation.

Several issues related to the paper were also discussed. For example, we debated various models evaluated in the paper for analysis of the graph-based approach. We couldn't decide whether the EM-HMM is just a special case of the Feature-HMM or it is not.
Another discussed issue was the difference between a constraint feature and given tags.

Only 12 tags were used for tagging in the paper. Hence they have to transform the original tagging of the tagged language to these 12 tags before evaluation. We wondered how would the transformation process work with 4000 tags of the PDT.

At the end of the lesson we spoke about combining the presented method with a small amount of labeled data of the language that wasn't initially labeled at all. Manual labeling of, for example, 100 sentences is relatively easy and non-expensive and it could bring a substantial improvement.

Conclusion

The paper was well presented and interesting for reading.It brought several interesting topics for the later discussion.


[ Back to the navigation ] [ Back to the content ]