Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
courses:rg:cross-lingual-link-structure-of-wikipedia [2011/04/04 17:31] green vytvořeno |
courses:rg:cross-lingual-link-structure-of-wikipedia [2011/07/06 10:46] (current) popel typo |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Untangling the Cross-Lingual Link Structure of Wikipedia ====== | ||
+ | **Gerard de Melo, Gerhard Weikum** | ||
+ | {{http:// | ||
+ | in proceedings of ACL 2010 | ||
- | ===== Comments | + | Report by **Nathan Green** and **Lasha Abzianidze** |
- | -The paper had to main goals: | + | ===== Introduction |
- | -We discussed | + | The paper has two main goals: |
- | -We discussed | + | * to find inaccurate connections in Wikipedia' |
+ | * to extract some cross language named entity | ||
+ | For demonstrating | ||
+ | WDGS problem aims to eliminate inconsistency (distinct nodes should not be connected) in the graph by removing a minimal cost set of edges (form the graph) and nodes (from a distinctness assertion). Unfortunately WDGS problem is NP-hard and the authors use polynomial-time Approximation Algorithm to find the solution. | ||
+ | |||
+ | The approximation algorithm consists of the following procedures: | ||
+ | ▫ Finds a solution of a linear program (LP) relaxation of the original problem; | ||
+ | ▫ With the help of the solution, constructs the extended graph; | ||
+ | ▫ While there is an unsatisfied assertions: | ||
+ | ▫ Chooses a node from each unsatisfied assertions and grows the regions around these nodes with the same radius; | ||
+ | ▫ Finds the optimal radius and removes the regions from the extended graph. | ||
+ | |||
+ | The algorithm is used on two datasets: | ||
+ | ▫ Dataset A - captures cross-lingual links between pages. | ||
+ | ▫ Dataset B - Dataset A with redirect-based links. | ||
+ | |||
+ | Before applying the approximation algorithm to large graphs which are sparsely connected and consisting of many smaller, densely connected subgraphs, they use graph partitioning heuristics to decompose larger graphs into smaller parts. In this way they overcome the time complexity. | ||
+ | |||
+ | In the end, output of the algorithm is a clean set of connected components, where each Wikipedia article or category belongs to a connected component consisting of pages about the same entity or concept. These connected components are regarded as equivalence classes. They represent a large-scale multilingual database of named entities and their translations. | ||
+ | |||
+ | ===== Comments ===== | ||
+ | * We discussed the general algorithm and its approach to using distinct sets of categories and its approximation algorithm for weighted distinctness graph separation. | ||
+ | * We discussed the use of bots in wikipedia to populate the links. It was unclear how many of the poor links are bot generated. It was suggested that it would be useful to augment the algorithm if wikipedia indicates in the change log which links came from bots (it was not clear if this was possible). | ||
+ | |||
+ | ===== What do we like about the paper ===== | ||
+ | * The article contributes to Wikipedia - a useful encyclopedia and a valuable source of cross-lingual information. | ||
+ | * The tasks are formalized in neat way, giving formal definitions, | ||
===== What do we dislike about the paper ===== | ===== What do we dislike about the paper ===== | ||
- | -It was unclear in Figure 1 what was simplified. The figure is also directed but does not show it graphically which made it difficult to figure out if we could have a term going to multiple languages. | + | * It was unclear in Figure 1 what was simplified. The figure is also directed but does not show it graphically which made it difficult to figure out if we could have a term going to multiple languages. |
- | -The software uses CPLEX which is commercial. This makes it hard to verify the results of the paper. | + | |
- | -The algorithm also seemed to not work on large datasets which would be a problem for most of its intended uses. | + | |