Future research directions in IR

Wondering what your next IR conference paper should be about? This is the billion dollar question (well, at least for IR researchers) that I surely won’t answer for you. But, here is some hint.
(I’ve just come across this on Facebook (thnx to Arjen P. De Vries and Claudia Hauff); this is evidence, that if you cut through all the clutter, FB can indeed be a great tool sometimes for finding serendipitous information. Maybe this is also something to think about…)
The list contains nominated papers from prominent IR researchers “that, in their opinion, represent important new directions, research areas, or results in the IR field.”
I must say I thoroughly enjoyed reading it. And yes, it does make me feel good that I see our last year’s ECIR paper with Elena Smirnova on the list :)

TREC Entity related developments

There has been a lot of silence on this blog since May. This is not because I have too little to say, but I have too much to do :)

A lot of effort has gone into organizing the TREC Entity track; those who are interested could follow developments on the track’s mailing list and blog. Topics are available for both the main (Related Entity Finding) and for the pilot (Entity List Completion) tasks. Developing topics for the latter involved some engineering work that I think might be worth sharing; I’m planning to do so, but don’t take it as a promise.

Another Entity track related development is that Marc Bron, Maarten de Rijke and myself have a paper accepted at CIKM 2010. In this paper, we propose a generative modeling framework for addressing the related entity finding (REF) task and perform a detailed analysis of four core components; co-occurrence models, type filtering, context modeling and homepage finding. Check out the abstract or the full paper. We made a number of resources used in the paper available to help others to repeat and improve upon our experiments.

Two evaluation campaigns related to entity/expert search

The CLEF 2010 labs will feature two evaluation campaigns that are potentially of interest to people working in the area of entity/people/expert search.

The third WePS Evaluation Workshop (WePS3) focuses on two tasks related to web entity search:

The Cross-lingual Expert Search (CriES) workshop addresses the problem of multi-lingual expert search in social media environments. The workshop also includes a pilot challenge, which is very much like the expert finding task at the TREC Enterprise track: given a document collection and a query topic, return a ranked list of experts, who are likely to be experts on the topic. However, the document collection is a multilingual social environment (Yahoo! Answers) and topics come in 4 different languages (English, German, French, Spanish).

Dataset of 1 billion web pages

The ambitious goal set out for TREC 2009 was to have a collection of 1 billion web pages. One dataset that can be shared by several tracks (specifically, the Entity, Million query, Relevance feedback, and Web tracks).
In November 2008, when this was discussed at the TREC 2008 conference, people were concerned with two main questions: (1) Is it possible to create such crawl (given the serious time constraints)? (2)  Are we going to be able to handle (at least, index) this amount of data?
Jamie Callan was confident that they (the Language Technologies Institute at Carnegie Mellon University) could build this crawl by March 2009. His confidence was not unfounded, since they had managed to create a crawl of a few hundreds of millions of web pages earlier. Yet, the counter for the one billion documents collection was to be started from 0 again…
Against this background, let us fast forward to the present. The crawl has recently completed and the dataset, referred to as ClueWeb09, is now available. It is 25 terabytes uncompressed (5 terabytes compressed), which brings me back to the troubling question: are we going to be able to handle that? We (being ILPS) will certainly do our best to step up to the challenge. I shall post about our attempts in detail later on.
But, it is a fact that doing retrieval on 1 billion documents is too big of a bite for many research groups, as it calls for nontrivial software and hardware architecture (note that it is 40 times more data than the Gov2 corpus, which I believe was the largest web crawl available to the research community so far with its 25 million documents). Therefore, a “Category B” subset of the collection is also available, consisting of “only” 50 million English pages. Some of the tracks (the Entity track for sure) will use only the Category B subset in 2009.

Next Page →