[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]

Table of Contents

Using a Wikipedia-based Semantic Relatedness Measure for Document Clustering

written by Majid Yazdani (Idiap Research Institute and EPFL) and Andrei Popescu-Belis (Idiap Research Institute)

spoken by Marie Konárová

reported by Martin Majlis


On Monday, November 14th 2011, Marie spoke about the paper describing using information from the Wikipedia to improve document clustering.


Presented paper focused on document clustering using random walk model. In the first phase they built document network from Wikipedia (WEX dataset - Metaweb Technologies, 2010) excluding articles from the following categories: Talk, File, Image, Template, Category, Portal, and List. Moreover short articles (less than 100 non-stop words) and disambiguation pages were removed, so 1,264,611 articles were kept, forming the nodes of their network.
They used two types of links between nodes - hyperlinks (each document is connected to the referred documents) and links based on document similarity (each document is connected to the 10 articles that are most similar to it).

For computing relatedness in the network they used visiting probability based on random walk model. The suggested model is not computationally tractable so they introduced approximation algorithm and computed (and proved) its truncation error upper bound.

They used that model for solving the problem of document clustering over the 20 Newsgroups dataset. They tested different setups of parameters to find the best combination. Their optimal setup outperforms previously published results. Proposed approximation algorithms made computation efficient for large graphs (more than
one million nodes and 40 million links) and large text clustering datasets (20,000 documents in 20 Newsgroups).


The paper was well presented and interesting for reading. It brought several interesting topics for the later discussion.

[ Back to the navigation ] [ Back to the content ]