courses:rg:2013:visiterms [ufal wiki]

Having described an input image with SIFT descriptors, why do they cluster them (using the K-means clustering algorithm, top of p. 834)?
What kind of things, in Feng and Lapata's work, have topics? What do these topics predict?
They have defined $W_I^\ast$ (eqs. 2, 3, 4, 5, 7, 8) as “the subset of keywords which appropriately describe image I (…)”. However, in the end of this subsection (Image Annotation), they say they take an n-best list from candidates ordered by the probability. What is the difference between this n-best list and $W_I^\ast$ as defined in the equations above?
In the last paragraph of section 5, they argue for not using (7) for the text illustration task, as opposed to image annotation. How did they obtain the result in image annotation then?

Institute of Formal and Applied Linguistics Wiki