Table of Contents
Untangling the Cross-Lingual Link Structure of Wikipedia
Gerard de Melo, Gerhard Weikum
Untangling the Cross-Lingual Link Structure of Wikipedia
in proceedings of ACL 2010
Report by Nathan Green and Lasha Abzianidze
Introduction
The paper has two main goals:
- to find inaccurate connections in Wikipedia's cross-lingual link structure and use their algorithm that attempts to repair these connections;
- to extract some cross language named entity and concept pairs.
For demonstrating the problem, authors give several examples of imprecise or wrong cross-lingual links between Wikipedia articles. In order to detect the inaccurate links, the union of cross-lingual links are modeled as an undirected graph with weighted edges. Authors introduce the notion of Distinctness Assertions - a set of pairwise disjoint sets of the nodes (entities) in the graph. The problem of fixing the wrong links is reduced to the Weighted Distinctness-Based Graph Separation (WDGS).
WDGS problem aims to eliminate inconsistency (distinct nodes should not be connected) in the graph by removing a minimal cost set of edges (form the graph) and nodes (from a distinctness assertion). Unfortunately WDGS problem is NP-hard and the authors use polynomial-time Approximation Algorithm to find the solution.
The approximation algorithm consists of the following procedures:
▫ Finds a solution of a linear program (LP) relaxation of the original problem;
▫ With the help of the solution, constructs the extended graph;
▫ While there is an unsatisfied assertions:
▫ Chooses a node from each unsatisfied assertions and grows the regions around these nodes with the same radius;
▫ Finds the optimal radius and removes the regions from the extended graph.
The algorithm is used on two datasets:
▫ Dataset A - captures cross-lingual links between pages.
▫ Dataset B - Dataset A with redirect-based links.
Before applying the approximation algorithm to large graphs which are sparsely connected and consisting of many smaller, densely connected subgraphs, they use graph partitioning heuristics to decompose larger graphs into smaller parts. In this way they overcome the time complexity.
In the end, output of the algorithm is a clean set of connected components, where each Wikipedia article or category belongs to a connected component consisting of pages about the same entity or concept. These connected components are regarded as equivalence classes. They represent a large-scale multilingual database of named entities and their translations.
Comments
- We discussed the general algorithm and its approach to using distinct sets of categories and its approximation algorithm for weighted distinctness graph separation.
- We discussed the use of bots in wikipedia to populate the links. It was unclear how many of the poor links are bot generated. It was suggested that it would be useful to augment the algorithm if wikipedia indicates in the change log which links came from bots (it was not clear if this was possible).
What do we like about the paper
- The article contributes to Wikipedia - a useful encyclopedia and a valuable source of cross-lingual information.
- The tasks are formalized in neat way, giving formal definitions, theorems and their proofs.
What do we dislike about the paper
- It was unclear in Figure 1 what was simplified. The figure is also directed but does not show it graphically which made it difficult to figure out if we could have a term going to multiple languages.
- The software uses CPLEX which is commercial. This makes it hard to verify the results of the paper.
- The algorithm also seemed to not work on large datasets which would be a problem for most of its intended uses.