The ambitious goal set out for TREC 2009 was to have a collection of 1 billion web pages. One dataset that can be shared by several tracks (specifically, the Entity, Million query, Relevance feedback, and Web tracks).
In November 2008, when this was discussed at the TREC 2008 conference, people were concerned with two main questions: (1) Is it possible to create such crawl (given the serious time constraints)? (2) Are we going to be able to handle (at least, index) this amount of data?
Jamie Callan was confident that they (the Language Technologies Institute at Carnegie Mellon University) could build this crawl by March 2009. His confidence was not unfounded, since they had managed to create a crawl of a few hundreds of millions of web pages earlier. Yet, the counter for the one billion documents collection was to be started from 0 again…
Against this background, let us fast forward to the present. The crawl has recently completed and the dataset, referred to as ClueWeb09, is now available. It is 25 terabytes uncompressed (5 terabytes compressed), which brings me back to the troubling question: are we going to be able to handle that? We (being ILPS) will certainly do our best to step up to the challenge. I shall post about our attempts in detail later on.
But, it is a fact that doing retrieval on 1 billion documents is too big of a bite for many research groups, as it calls for nontrivial software and hardware architecture (note that it is 40 times more data than the Gov2 corpus, which I believe was the largest web crawl available to the research community so far with its 25 million documents). Therefore, a “Category B” subset of the collection is also available, consisting of “only” 50 million English pages. Some of the tracks (the Entity track for sure) will use only the Category B subset in 2009.
500+ thesis downloads
My thesis hit a significant milestone last week as it crossed the 500 download mark. It took less than 8 months since it was made available online in 2008 July to reach this.
The first release of the implementation of the models introduced in the thesis, alias EARS (Entity and Association Retrieval System), is expected to arrive before the end of this month.
Language Modeling Overview
The boom of language modeling (LM) approaches to information retrieval started in 1998, with Ponte and Croft’s SIGIR’98 paper (which, btw, is near to reaching a milestone of 1000 citations according to Google scholar). At about the same time, and apparently independent of Ponte and Croft’s work, Hiemstra and Kraaij and Miller et. al. proposed the same idea of scoring documents by query-likelihood.
The last decade has witnessed tremendous progress in the use and development of LM techniques. Language models are attractive because of their strong foundations in statistical theory and their superior empirical performance. Further, they provide a principled way of modeling various special retrieval tasks—expert finding is a prominent example of that.
The latest issue of Foundations and Trends in Information Retrieval is featuring an excellent article Statistical Language Models for Information Retrieval: A Critical Review, by ChengXiang Zhai. It is a great survey that covers a wide spectrum of the work on LMs, with many useful references for further reading. In summary, this paper is highly recommended both for experts in language modeling and for newcomers to the field.
SAW 2009 deadline extension
The submission deadline for the 3rd Workshop on Social Aspects of the Web (SAW 2009) has been extended to Feb 17, 2009.
The workshop accepts submissions of long papers (max. 12 pages), work-in-progress reports (max. 6 pages), and demo papers (max. 4 pages).
A Late “Happy New Year!”
Never too late for a happy new year…
I was pretending to be on vacation (while, in fact, working on some interesting proposal), but now I’m officially back in business.
I wanted my first 2009 post to be on “looking back on 2008”, but I had to face reality and realize that writing that summary might be too hard and definitely too time-consuming.
Nevertheless, I still wanted to summarize my scientific outcome somehow, and then I came across a great website, called QuadSearch. It ranks your publications based on citation counts, calculates statistics and research impact indexes, such as the H-index and G-index. The coverage is not perfect, but is pretty decent, as far as I can tell.
And the numbers are…
H-INDEX (Hirsch Number): 8
Egghe’s G-INDEX: 13
Maximum Cites: 74
Total Cites: 214, Total Articles: 34
Cites/Paper: 6.2941

The top 5 papers from this chart are:
- Formal models for expert finding in enterprise corpora; SIGIR 2006 (Cited by 74)
- Finding experts and their details in e-mail corpora; WWW 2006 (Cited by 27)
- Language Modeling Approaches for Enterprise Tasks; TREC 2005 (Cited by 16)
- Why are they excited? identifying and explaining spikes in blog mood level; EACL 2006 (Cited by 13)
- Broad expertise retrieval in sparse data environments; SIGIR 2007 (Cited by 13)
Let’s see how much these numbers improve in 2009 :)