[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]


This shows you the differences between two versions of the page.

Link to this comparison view

courses:rg:using-a-wikipedia-based-semantic-relatedness-measure-for-document-clustering [2011/11/23 11:29]
majlis vytvořeno
courses:rg:using-a-wikipedia-based-semantic-relatedness-measure-for-document-clustering [2012/01/01 20:27] (current)
Line 1: Line 1:
-====== ​Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections ​====== +====== ​Using a Wikipedia-based Semantic Relatedness Measure for Document Clustering ​======
- +
-TODO: ukradeny report z minuleho tydne - prepsat+
 written by Majid Yazdani (Idiap Research Institute and EPFL) and Andrei Popescu-Belis (Idiap Research Institute) written by Majid Yazdani (Idiap Research Institute and EPFL) and Andrei Popescu-Belis (Idiap Research Institute)
Line 15: Line 13:
 ===== Notes ===== ===== Notes =====
-Presented paper focused on building a graph from a combination of a parallel corpus ​and a bigger monolingual corpus of non-labeled language. We discussed the possibility of using solely the parallel corpus ​and we agreedthat it may be possible. The steps of building the graph and the label propagation problem were clarified+Presented paper focused on document clustering using random walk model. In the first phase they built document network ​from Wikipedia (WEX dataset - Metaweb Technologies,​ 2010) excluding articles from the following categories: Talk, File, Image, Template, Category, Portal, ​and List. Moreover short articles (less than 100 non-stop words) ​and disambiguation pages were removed, so 1,264,611 articles were kept, forming the nodes of their network
- +They used two types of links between nodes - hyperlinks ​(each document is connected to the referred documents) and links based on document ​similarity (each document is connected to the 10 articles ​that are most similar ​to it).
-The authors ​of presented paper use monolingual similarity ​(the weights between the vertices of one language) and a bilingual ​similarity ​function ​(then weights between two different languages) for setting ​the weight over two vertices for the label propagation. Then we discussed the conclusion from the paper, ​that the label propagation brings more information about the language compared ​to the approach without propagation.  +
- +
-Several issues related to the paper were also discussed. For example, we debated various models ​ evaluated in the paper for analysis of the graph-based approach. ​ We couldn'​t decide whether the EM-HMM is just a special case of the Feature-HMM or it is not. +
-Another discussed issue  was the difference between a constraint feature and given tags.+
-Only 12 tags were used for tagging ​in the paperHence they have to transform the original tagging of the tagged language to these 12 tags before evaluation. We wondered how would the transformation process work  with 4000 tags of the PDT.+For computing relatedness ​in the network they used visiting probability based on random walk modelThe suggested model is not computationally tractable so they introduced approximation algorithm and computed (and proved) its truncation error upper bound.
-At the end of the lesson we spoke about combining the presented method with a small amount of labeled data of the language that wasn't initially labeled ​ at allManual labeling of, for example, 100 sentences is relatively easy and non-expensive ​and it could bring a substantial improvement.+They used that model for solving ​the problem ​of document clustering over the 20 Newsgroups dataset. They tested different setups ​of parameters to find the best combinationTheir optimal setup outperforms previously published results. Proposed approximation algorithms made computation efficient ​for large graphs (more than 
 +one million nodes and 40 million links) ​and large text clustering datasets (20,000 documents in 20 Newsgroups).
 ===== Conclusion ===== ===== Conclusion =====
-The paper was well presented and interesting for reading.It brought several interesting topics for the later discussion.+The paper was well presented and interesting for reading. It brought several interesting topics for the later discussion.

[ Back to the navigation ] [ Back to the content ]