Back on TREC

Yes, things have been quiet lately on the TREC Entity homepage. Now that training topics have been made available, I sincerely hope that this is about to change. We are in the process of developing test topics and finalizing the guidelines, so make sure your voice is heard if you want something different…

TREC Entity – draft guidelines available

The first year of the track will investigate the problem of related entity finding. Given the name and homepage of an entity, as well as the type of the target entity, find related entities that are of target type. Entity types are limited to people, organizations, and products. Participating systems need to return homepages of related entities, and, optionally, a string answer that represents the entity concisely (i.e., the name of the entity).

The draft guidelines are available at this page.

Dataset of 1 billion web pages

The ambitious goal set out for TREC 2009 was to have a collection of 1 billion web pages. One dataset that can be shared by several tracks (specifically, the Entity, Million query, Relevance feedback, and Web tracks).
In November 2008, when this was discussed at the TREC 2008 conference, people were concerned with two main questions: (1) Is it possible to create such crawl (given the serious time constraints)? (2)  Are we going to be able to handle (at least, index) this amount of data?
Jamie Callan was confident that they (the Language Technologies Institute at Carnegie Mellon University) could build this crawl by March 2009. His confidence was not unfounded, since they had managed to create a crawl of a few hundreds of millions of web pages earlier. Yet, the counter for the one billion documents collection was to be started from 0 again…
Against this background, let us fast forward to the present. The crawl has recently completed and the dataset, referred to as ClueWeb09, is now available. It is 25 terabytes uncompressed (5 terabytes compressed), which brings me back to the troubling question: are we going to be able to handle that? We (being ILPS) will certainly do our best to step up to the challenge. I shall post about our attempts in detail later on.
But, it is a fact that doing retrieval on 1 billion documents is too big of a bite for many research groups, as it calls for nontrivial software and hardware architecture (note that it is 40 times more data than the Gov2 corpus, which I believe was the largest web crawl available to the research community so far with its 25 million documents). Therefore, a “Category B” subset of the collection is also available, consisting of “only” 50 million English pages. Some of the tracks (the Entity track for sure) will use only the Category B subset in 2009.

500+ thesis downloads

My thesis hit a significant milestone last week as it crossed the 500 download mark. It took less than 8 months since it was made available online in 2008 July to reach this.

The first release of the implementation of the models introduced in the thesis, alias EARS (Entity and Association Retrieval System), is expected to arrive before the end of this month.

Language Modeling Overview

The boom of language modeling (LM) approaches to information retrieval started in 1998, with Ponte and Croft’s SIGIR’98 paper (which, btw, is near to reaching a milestone of 1000 citations according to Google scholar). At about the same time, and apparently independent of Ponte and Croft’s work, Hiemstra and Kraaij and Miller et. al. proposed the same idea of scoring documents by query-likelihood.

The last decade has witnessed tremendous progress in the use and development of LM techniques. Language models are attractive because of their strong foundations in statistical theory and their superior empirical performance. Further, they provide a principled way of modeling various special retrieval tasks—expert finding is a prominent example of that.

The latest issue of Foundations and Trends in Information Retrieval is featuring an excellent article Statistical Language Models for Information Retrieval: A Critical Review, by ChengXiang Zhai. It is a great survey that covers a wide spectrum of the work on LMs, with many useful references for further reading. In summary, this paper is highly recommended both for experts in language modeling and for newcomers to the field.