TREC 2010 summary

The 19th Text REtrieval Conference (TREC) took place at the “usual” time and place: Gaithersburg, MD, in the second half of November. Seven tracks ran in 2010: Blog, Chemical IR, Entity, Legal, Relevance Feedback, Session, and Web.
The Entity track was very popular both in terms of the number of participants and the number of posters presented. The proposed approaches displayed a great degree of diversity and made the presentations very interesting. I don’t want to repeat myself, so I refer to the posts on the Entity website for the conference summary and plans for 2011.
As to TREC 2011, the Chemical IR, Entity, Session, Legal, and Web tracks will continue. The Blog track will migrate to a new Microblog track and will investigate social search, especially search over Twitter data. Two more new tracks will be added: Crowdsourcing (as a means of evaluation) and Medical records (content-based access to the free text fields of medical records, e.g., find patients with disease X treated with Y). Finally, CMU is planning another Web crawl, successor to ClueWeb09; one idea is to have a smaller set of pages, but crawled regularly over a period of time.

Hadoop Hackathon @SARA

SARA organizes a kick-off meeting for its Proof-of-Concept Hadoop service on Dec 7, 2010 at the Science Park, Amsterdam. A major part of the event will be a “hackathon”, a hands-on introduction to Hadoop, with the support of two Hadoop-experts: Edgar Meij and Djoerd Hiemstra. It’s a good opportunity to learn about Hadoop and play with it on existing datasets (for example the Wikipedia, ENRON, or White House access records), or on a case of choice.

TREC Entity related developments

There has been a lot of silence on this blog since May. This is not because I have too little to say, but I have too much to do :)

A lot of effort has gone into organizing the TREC Entity track; those who are interested could follow developments on the track’s mailing list and blog. Topics are available for both the main (Related Entity Finding) and for the pilot (Entity List Completion) tasks. Developing topics for the latter involved some engineering work that I think might be worth sharing; I’m planning to do so, but don’t take it as a promise.

Another Entity track related development is that Marc Bron, Maarten de Rijke and myself have a paper accepted at CIKM 2010. In this paper, we propose a generative modeling framework for addressing the related entity finding (REF) task and perform a detailed analysis of four core components; co-occurrence models, type filtering, context modeling and homepage finding. Check out the abstract or the full paper. We made a number of resources used in the paper available to help others to repeat and improve upon our experiments.