Language Modeling Overview

The boom of language modeling (LM) approaches to information retrieval started in 1998, with Ponte and Croft’s SIGIR’98 paper (which, btw, is near to reaching a milestone of 1000 citations according to Google scholar). At about the same time, and apparently independent of Ponte and Croft’s work, Hiemstra and Kraaij and Miller et. al. proposed the same idea of scoring documents by query-likelihood.

The last decade has witnessed tremendous progress in the use and development of LM techniques. Language models are attractive because of their strong foundations in statistical theory and their superior empirical performance. Further, they provide a principled way of modeling various special retrieval tasks—expert finding is a prominent example of that.

The latest issue of Foundations and Trends in Information Retrieval is featuring an excellent article Statistical Language Models for Information Retrieval: A Critical Review, by ChengXiang Zhai. It is a great survey that covers a wide spectrum of the work on LMs, with many useful references for further reading. In summary, this paper is highly recommended both for experts in language modeling and for newcomers to the field.

Thesis resources #1: CSIRO candidates and associations

As promised before, it’s now time to start sharing some resources that I obtained during my thesis work. This first release contains two CSIRO related items: the list of CSIRO candidates (e-mail addresses) and a list of document-candidate associations.
I was actually keen to make these available before the submission deadline for the Expert Search runs at the TREC 2008 Enterprise track. These lists, of course, are far from perfect, but worked for me quite well. If you have comments, suggestions, improved versions, etc. feel free to contact me!
The files are available at the same place as the CERC collection (so you’ll need the same username and password): http://es.csiro.au/cerc/data/balog. Thanks to Paul Thomas for arranging the hosting!

Thesis approved

I am happy to announce that my PhD thesis titled People Search in the Enterprise has been approved by the committee. The public PhD defense will take place on the 30th of September, 2008.
It is planned that the final version of thesis will be made available online early July, 2008.

Thesis completed

I am happy to announce that my thesis titled People Search in the Enterprise has been completed and submitted to the committee.

The main focus in the thesis is on two main expertise retrieval tasks: (1) expert finding — identifying a list of people who are knowledgeable about a given topic (“Who are the experts on topic X?”) and (2) expert profiling — returning a list of topics that a person is knowledgeable about (“What topics does person Y know about?”). In the thesis, expertise retrieval is approached as an association finding task between people and topics.

The main contribution of the thesis is a generative probabilistic modeling framework for capturing the expert finding and profiling tasks in a uniform way. On top of this general framework two main families of models are introduced, by adapting generative language modeling techniques for document retrieval in a transparent and theoretically sound way.

Throughout the thesis we extensively evaluate and compare these baseline models across different organizational settings, and perform an extensive and systematic exploration and analysis of the experimental results obtained. We show that our baseline models are robust yet deliver very competitive performance.

Through a series of examples we demonstrate that our generic models are able to incorporate and exploit special characteristics and features of test collections and/or the organizational settings that they represent. Additionally, we address a number of related tasks, including finding similar experts, mining contact details of people, and enterprise document search.

Finally, we provide further examples that illustrate the generic nature of our baseline models and apply them to find associations between topics and entities other than people.

Assuming that the committee’s answer is affirmative, the thesis is going to be printed in early June 2008.

Happy new year & welcome back

I took a little break from work so I could celebrate Christmas, spend time with the family, etc. I am back online now, and ready to commit myself to full-time thesis writing for the upcoming several weeks.

As to expert search material, here is a quick update.

  • Our (me and Maarten de Rijke) recent paper titled Associating People and Documents has been accepted to ECIR 2008. Common to most expertise search approaches is a component that estimates the strength of the association between a document and a people. In this paper we perform a careful analysis and investigation of how different association methods contribute to performance. The camera-ready version of the paper will be available from the Publications page, after jan 11).
  • We (me, Maarten, and Leif Azzopardi) submitted a paper titled A Language Modeling Framework for Expertise Search to the Information Processing and Management (IPM) journal. In this paper we introduce our language modeling approaches to expertise search in detail, and integrate these into a generative probabilistic framework. Since it is not a conference paper, it may take some time until it can be published.

There is some reading material from CIKM 2007:

Looks like the topic of expertise retrieval is gaining more and more popularity in IR conferences. While browsing the list of accepted papers for ECIR 2008, I found 3 full papers (out of 33) and 1 short paper (out of 19) about expert search, which gives the topic a solid presence.

  • (Serdyukov and Hiemstra)
    Modeling documents as mixtures of persons for expert finding [full]
  • (Balog and de Rijke)
    Associating People and Documents [full]
  • (Macdonald et al.)
    High Quality Expertise Evidence for Expert Search [full]
  • (Macdonald and Ounis)
    Expert Search Evaluation by Supporting Documents [short]