Future research directions in IR

Wondering what your next IR conference paper should be about? This is the billion dollar question (well, at least for IR researchers) that I surely won’t answer for you. But, here is some hint.
(I’ve just come across this on Facebook (thnx to Arjen P. De Vries and Claudia Hauff); this is evidence, that if you cut through all the clutter, FB can indeed be a great tool sometimes for finding serendipitous information. Maybe this is also something to think about…)
The list contains nominated papers from prominent IR researchers “that, in their opinion, represent important new directions, research areas, or results in the IR field.”
I must say I thoroughly enjoyed reading it. And yes, it does make me feel good that I see our last year’s ECIR paper with Elena Smirnova on the list :)

EARS released

After a period of development I am ready to release EARS to the world. EARS is an open source toolkit for entity-oriented search and discovery in large text collections. The association finding framework and models implemented in EARS were originally developed for expertise retrieval in an organizational setting, during my PhD studies. These models are robust and generic, and can be applied to finding associations between topics and entities other than people.

At present, EARS supports two main tasks: finding entities (“Which entities are associated with topic X?”) and profiling entities (“What topics is an entity associated with?”), and implements two baseline search strategies for accomplishing these tasks; these became popularly known as “Model 1″ and “Model 2″.

A software system will never be finished; EARS is no exception to that rule. It, however, is an active research project with ongoing development and enhancements. A number of new models and features will be included in upcoming releases. Feedback, comments, and suggestions are always welcome.

The toolkit is available at http://code.google.com/p/ears/.

Dataset of 1 billion web pages

The ambitious goal set out for TREC 2009 was to have a collection of 1 billion web pages. One dataset that can be shared by several tracks (specifically, the Entity, Million query, Relevance feedback, and Web tracks).
In November 2008, when this was discussed at the TREC 2008 conference, people were concerned with two main questions: (1) Is it possible to create such crawl (given the serious time constraints)? (2)  Are we going to be able to handle (at least, index) this amount of data?
Jamie Callan was confident that they (the Language Technologies Institute at Carnegie Mellon University) could build this crawl by March 2009. His confidence was not unfounded, since they had managed to create a crawl of a few hundreds of millions of web pages earlier. Yet, the counter for the one billion documents collection was to be started from 0 again…
Against this background, let us fast forward to the present. The crawl has recently completed and the dataset, referred to as ClueWeb09, is now available. It is 25 terabytes uncompressed (5 terabytes compressed), which brings me back to the troubling question: are we going to be able to handle that? We (being ILPS) will certainly do our best to step up to the challenge. I shall post about our attempts in detail later on.
But, it is a fact that doing retrieval on 1 billion documents is too big of a bite for many research groups, as it calls for nontrivial software and hardware architecture (note that it is 40 times more data than the Gov2 corpus, which I believe was the largest web crawl available to the research community so far with its 25 million documents). Therefore, a “Category B” subset of the collection is also available, consisting of “only” 50 million English pages. Some of the tracks (the Entity track for sure) will use only the Category B subset in 2009.

Language Modeling Overview

The boom of language modeling (LM) approaches to information retrieval started in 1998, with Ponte and Croft’s SIGIR’98 paper (which, btw, is near to reaching a milestone of 1000 citations according to Google scholar). At about the same time, and apparently independent of Ponte and Croft’s work, Hiemstra and Kraaij and Miller et. al. proposed the same idea of scoring documents by query-likelihood.

The last decade has witnessed tremendous progress in the use and development of LM techniques. Language models are attractive because of their strong foundations in statistical theory and their superior empirical performance. Further, they provide a principled way of modeling various special retrieval tasks—expert finding is a prominent example of that.

The latest issue of Foundations and Trends in Information Retrieval is featuring an excellent article Statistical Language Models for Information Retrieval: A Critical Review, by ChengXiang Zhai. It is a great survey that covers a wide spectrum of the work on LMs, with many useful references for further reading. In summary, this paper is highly recommended both for experts in language modeling and for newcomers to the field.

Thesis resources #1: CSIRO candidates and associations

As promised before, it’s now time to start sharing some resources that I obtained during my thesis work. This first release contains two CSIRO related items: the list of CSIRO candidates (e-mail addresses) and a list of document-candidate associations.
I was actually keen to make these available before the submission deadline for the Expert Search runs at the TREC 2008 Enterprise track. These lists, of course, are far from perfect, but worked for me quite well. If you have comments, suggestions, improved versions, etc. feel free to contact me!
The files are available at the same place as the CERC collection (so you’ll need the same username and password): http://es.csiro.au/cerc/data/balog. Thanks to Paul Thomas for arranging the hosting!