Table of Contents

Untangling the Cross-Lingual Link Structure of Wikipedia

Gerard de Melo, Gerhard Weikum
Untangling the Cross-Lingual Link Structure of Wikipedia
in proceedings of ACL 2010

Report by Nathan Green and Lasha Abzianidze

Introduction

The paper has two main goals:

For demonstrating the problem, authors give several examples of imprecise or wrong cross-lingual links between Wikipedia articles. In order to detect the inaccurate links, the union of cross-lingual links are modeled as an undirected graph with weighted edges. Authors introduce the notion of Distinctness Assertions - a set of pairwise disjoint sets of the nodes (entities) in the graph. The problem of fixing the wrong links is reduced to the Weighted Distinctness-Based Graph Separation (WDGS).

WDGS problem aims to eliminate inconsistency (distinct nodes should not be connected) in the graph by removing a minimal cost set of edges (form the graph) and nodes (from a distinctness assertion). Unfortunately WDGS problem is NP-hard and the authors use polynomial-time Approximation Algorithm to find the solution.

The approximation algorithm consists of the following procedures:
  ▫ Finds a solution of a linear program (LP) relaxation of the original problem;
  ▫ With the help of the solution, constructs the extended graph;
  ▫ While there is an unsatisfied assertions:
      ▫ Chooses a node from each unsatisfied assertions and grows the regions around these nodes with the same radius;
      ▫ Finds the optimal radius and removes the regions from the extended graph.

The algorithm is used on two datasets:
  ▫ Dataset A - captures cross-lingual links between pages.
  ▫ Dataset B - Dataset A with redirect-based links.

Before applying the approximation algorithm to large graphs which are sparsely connected and consisting of many smaller, densely connected subgraphs, they use graph partitioning heuristics to decompose larger graphs into smaller parts. In this way they overcome the time complexity.

In the end, output of the algorithm is a clean set of connected components, where each Wikipedia article or category belongs to a connected component consisting of pages about the same entity or concept. These connected components are regarded as equivalence classes. They represent a large-scale multilingual database of named entities and their translations.

Comments

What do we like about the paper

What do we dislike about the paper