[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

TODO: ukradeny report z minuleho tydne - prepsat

written by Majid Yazdani (Idiap Research Institute and EPFL) and Andrei Popescu-Belis (Idiap Research Institute)

spoken by Marie Konárová

reported by Martin Majlis

Introduction

On Monday, November 14th 2011, Marie spoke about the paper describing using information from the Wikipedia to improve document clustering.

Notes

Presented paper focused on building a graph from a combination of a parallel corpus and a bigger monolingual corpus of non-labeled language. We discussed the possibility of using solely the parallel corpus and we agreed, that it may be possible. The steps of building the graph and the label propagation problem were clarified.

The authors of presented paper use monolingual similarity (the weights between the vertices of one language) and a bilingual similarity function (then weights between two different languages) for setting the weight over two vertices for the label propagation. Then we discussed the conclusion from the paper, that the label propagation brings more information about the language compared to the approach without propagation.

Several issues related to the paper were also discussed. For example, we debated various models evaluated in the paper for analysis of the graph-based approach. We couldn't decide whether the EM-HMM is just a special case of the Feature-HMM or it is not.
Another discussed issue was the difference between a constraint feature and given tags.

Only 12 tags were used for tagging in the paper. Hence they have to transform the original tagging of the tagged language to these 12 tags before evaluation. We wondered how would the transformation process work with 4000 tags of the PDT.

At the end of the lesson we spoke about combining the presented method with a small amount of labeled data of the language that wasn't initially labeled at all. Manual labeling of, for example, 100 sentences is relatively easy and non-expensive and it could bring a substantial improvement.

Conclusion

The paper was well presented and interesting for reading.It brought several interesting topics for the later discussion.


[ Back to the navigation ] [ Back to the content ]