|
|
courses:rg:using-a-wikipedia-based-semantic-relatedness-measure-for-document-clustering [2011/11/23 11:29] majlis vytvořeno |
courses:rg:using-a-wikipedia-based-semantic-relatedness-measure-for-document-clustering [2012/01/01 20:27] (current) majlis |
====== Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections ====== | ====== Using a Wikipedia-based Semantic Relatedness Measure for Document Clustering ====== |
| |
TODO: ukradeny report z minuleho tydne - prepsat | |
| |
written by Majid Yazdani (Idiap Research Institute and EPFL) and Andrei Popescu-Belis (Idiap Research Institute) | written by Majid Yazdani (Idiap Research Institute and EPFL) and Andrei Popescu-Belis (Idiap Research Institute) |
===== Notes ===== | ===== Notes ===== |
| |
Presented paper focused on building a graph from a combination of a parallel corpus and a bigger monolingual corpus of non-labeled language. We discussed the possibility of using solely the parallel corpus and we agreed, that it may be possible. The steps of building the graph and the label propagation problem were clarified. | Presented paper focused on document clustering using random walk model. In the first phase they built document network from Wikipedia (WEX dataset - Metaweb Technologies, 2010) excluding articles from the following categories: Talk, File, Image, Template, Category, Portal, and List. Moreover short articles (less than 100 non-stop words) and disambiguation pages were removed, so 1,264,611 articles were kept, forming the nodes of their network. |
| They used two types of links between nodes - hyperlinks (each document is connected to the referred documents) and links based on document similarity (each document is connected to the 10 articles that are most similar to it). |
The authors of presented paper use monolingual similarity (the weights between the vertices of one language) and a bilingual similarity function (then weights between two different languages) for setting the weight over two vertices for the label propagation. Then we discussed the conclusion from the paper, that the label propagation brings more information about the language compared to the approach without propagation. | |
| |
Several issues related to the paper were also discussed. For example, we debated various models evaluated in the paper for analysis of the graph-based approach. We couldn't decide whether the EM-HMM is just a special case of the Feature-HMM or it is not. | |
Another discussed issue was the difference between a constraint feature and given tags. | |
| |
Only 12 tags were used for tagging in the paper. Hence they have to transform the original tagging of the tagged language to these 12 tags before evaluation. We wondered how would the transformation process work with 4000 tags of the PDT. | For computing relatedness in the network they used visiting probability based on random walk model. The suggested model is not computationally tractable so they introduced approximation algorithm and computed (and proved) its truncation error upper bound. |
| |
At the end of the lesson we spoke about combining the presented method with a small amount of labeled data of the language that wasn't initially labeled at all. Manual labeling of, for example, 100 sentences is relatively easy and non-expensive and it could bring a substantial improvement. | They used that model for solving the problem of document clustering over the 20 Newsgroups dataset. They tested different setups of parameters to find the best combination. Their optimal setup outperforms previously published results. Proposed approximation algorithms made computation efficient for large graphs (more than |
| one million nodes and 40 million links) and large text clustering datasets (20,000 documents in 20 Newsgroups). |
| |
===== Conclusion ===== | ===== Conclusion ===== |
| |
The paper was well presented and interesting for reading.It brought several interesting topics for the later discussion. | The paper was well presented and interesting for reading. It brought several interesting topics for the later discussion. |