Language Modeling Overview

The boom of language modeling (LM) approaches to information retrieval started in 1998, with Ponte and Croft’s SIGIR’98 paper (which, btw, is near to reaching a milestone of 1000 citations according to Google scholar). At about the same time, and apparently independent of Ponte and Croft’s work, Hiemstra and Kraaij and Miller et. al. proposed the same idea of scoring documents by query-likelihood.

The last decade has witnessed tremendous progress in the use and development of LM techniques. Language models are attractive because of their strong foundations in statistical theory and their superior empirical performance. Further, they provide a principled way of modeling various special retrieval tasks—expert finding is a prominent example of that.

The latest issue of Foundations and Trends in Information Retrieval is featuring an excellent article Statistical Language Models for Information Retrieval: A Critical Review, by ChengXiang Zhai. It is a great survey that covers a wide spectrum of the work on LMs, with many useful references for further reading. In summary, this paper is highly recommended both for experts in language modeling and for newcomers to the field.

Leave a Comment