Language Modeling Overview

The boom of language modeling (LM) approaches to information retrieval started in 1998, with Ponte and Croft’s SIGIR’98 paper (which, btw, is near to reaching a milestone of 1000 citations according to Google scholar). At about the same time, and apparently independent of Ponte and Croft’s work, Hiemstra and Kraaij and Miller et. al. proposed the same idea of scoring documents by query-likelihood.

The last decade has witnessed tremendous progress in the use and development of LM techniques. Language models are attractive because of their strong foundations in statistical theory and their superior empirical performance. Further, they provide a principled way of modeling various special retrieval tasks—expert finding is a prominent example of that.

The latest issue of Foundations and Trends in Information Retrieval is featuring an excellent article Statistical Language Models for Information Retrieval: A Critical Review, by ChengXiang Zhai. It is a great survey that covers a wide spectrum of the work on LMs, with many useful references for further reading. In summary, this paper is highly recommended both for experts in language modeling and for newcomers to the field.

A Late “Happy New Year!”

Never too late for a happy new year…
I was pretending to be on vacation (while, in fact, working on some interesting proposal), but now I’m officially back in business.

I wanted my first 2009 post to be on “looking back on 2008″, but I had to face reality and realize that writing that summary might be too hard and definitely too time-consuming.

Nevertheless, I still wanted to summarize my scientific outcome somehow, and then I came across a great website, called QuadSearch. It ranks your publications based on citation counts, calculates statistics and research impact indexes, such as the H-index and G-index. The coverage is not perfect, but is pretty decent, as far as I can tell.

And the numbers are…

H-INDEX (Hirsch Number): 8
Egghe’s G-INDEX: 13
Maximum Cites: 74
Total Cites: 214, Total Articles: 34
Cites/Paper: 6.2941


The top 5 papers from this chart are:

  1. Formal models for expert finding in enterprise corpora; SIGIR 2006 (Cited by 74)
  2. Finding experts and their details in e-mail corpora; WWW 2006 (Cited by 27)
  3. Language Modeling Approaches for Enterprise Tasks; TREC 2005 (Cited by 16)
  4. Why are they excited? identifying and explaining spikes in blog mood level; EACL 2006 (Cited by 13)
  5. Broad expertise retrieval in sparse data environments; SIGIR 2007 (Cited by 13)

Let’s see how much these numbers improve in 2009 :)