Entity-oriented evaluation efforts in 2012

I’ve got a couple of mails asking about TREC Entity 2012. For those that don’t know it yet: the track won’t run in 2012.

In a nutshell, the level of participation in 2011 was much lower than we would have wished, especially for the REF task; as a consequence, the resulting pools are probably not of great quality. The ELC task was more successful in terms of the number of submissions, but I don’t know about the quality; the relevance assessments are yet to be done there (this has unfortunately been long delayed, mostly because of my lack of time for finishing up the assessment interface). Apart from the ELC results, last year’s efforts has been documented in the 2011 track overview paper.

Why not continue in 2012? We did not see a point in repeating the related entity finding task; over the three years of the track we managed to build a healthy-sized topic set for those that want to work on this. And, we simply didn’t have a great idea for a “next big thing.” The track is not necessarily over, I’d prefer to say it’s on hold.

There is, however, a number of entity-related evaluation campaigns running in 2012. I compiled a list of these (and will try to keep it updated).

  • TREC Knowledge Base Acceleration (KBA) This is a new TREC track. The first edition will feature a special filtering task: given an incoming text stream (news and social media content) and a target entity from a knowledge base (for now: people, specified by their Freebase and Wikipedia entries), generate a score for each item (“document”) based on how “pertinent” it is to the target KB node. The first month of the incoming stream will come with human-generated labels and can be used as training data; the latter months are for evaluation.
  • INEX Data Centric Track (Not sure it’ll run in 2012, as the call is not out yet.) Last year’s track used the IMDB data collection and defined two task. The ad hoc search task has informational requests to be answered by a ranked list of IMDB entities (specifically, persons or movies). The faceted search task asks for a restricted list of facets and facet-values to help the user refine the query through a multi-step search session.
  • TAC Knowledge Base Population (KBP) The track investigates tasks related to extracting information about entities with reference to an external knowledge source (Wikipedia infoboxes). KBP 2011 had three tasks: entity-linking: given an entity name (person, organization, or geopolitical entity) and a document containing that name, determine the KB node for that entity or add a new node for the entity if it is not already in the KB; slot-filling: given a named entity and a pre-defined set of attributes (“slots”) for the entity type, augment a KB node for that entity by extracting all new learnable slot values from a large corpus of documents; temporal slot-filling: similar to the regular slot-filling task, but also requests time intervals to be specified for each extracted slot value.
  • CLEF RepLab This new CLEF Lab is set out to study the problem of online reputation management (ORM); in a sense this effort continues and takes the WePS3 ORM task to the next level by defining a longer-term research agenda and by setting up various tasks within the problem domain. The website is not up yet, but according to the CLEF Labs flyer two tasks will be evaluated on Twitter data: a monitoring task, where the goal is to thematically cluster tweets including a company’s name (this seems the exact same as the WePS3 ORM task); a profiling task, where the goal is to annotate tweets according to their polarity (i.e., whether they have positive or negative implications for the company’s reputation).

Feel free to send me a message about anything that might be added here.

TREC 2010 summary

The 19th Text REtrieval Conference (TREC) took place at the “usual” time and place: Gaithersburg, MD, in the second half of November. Seven tracks ran in 2010: Blog, Chemical IR, Entity, Legal, Relevance Feedback, Session, and Web.
The Entity track was very popular both in terms of the number of participants and the number of posters presented. The proposed approaches displayed a great degree of diversity and made the presentations very interesting. I don’t want to repeat myself, so I refer to the posts on the Entity website for the conference summary and plans for 2011.
As to TREC 2011, the Chemical IR, Entity, Session, Legal, and Web tracks will continue. The Blog track will migrate to a new Microblog track and will investigate social search, especially search over Twitter data. Two more new tracks will be added: Crowdsourcing (as a means of evaluation) and Medical records (content-based access to the free text fields of medical records, e.g., find patients with disease X treated with Y). Finally, CMU is planning another Web crawl, successor to ClueWeb09; one idea is to have a smaller set of pages, but crawled regularly over a period of time.

Dataset of 1 billion web pages

The ambitious goal set out for TREC 2009 was to have a collection of 1 billion web pages. One dataset that can be shared by several tracks (specifically, the Entity, Million query, Relevance feedback, and Web tracks).
In November 2008, when this was discussed at the TREC 2008 conference, people were concerned with two main questions: (1) Is it possible to create such crawl (given the serious time constraints)? (2)  Are we going to be able to handle (at least, index) this amount of data?
Jamie Callan was confident that they (the Language Technologies Institute at Carnegie Mellon University) could build this crawl by March 2009. His confidence was not unfounded, since they had managed to create a crawl of a few hundreds of¬†millions of web pages earlier. Yet, the counter for the one billion documents collection was to be started from 0 again…
Against this background, let us fast forward to the present. The crawl has recently completed and the dataset, referred to as ClueWeb09, is now available. It is 25 terabytes uncompressed (5 terabytes compressed), which brings me back to the troubling question: are we going to be able to handle that? We (being ILPS) will certainly do our best to step up to the challenge. I shall post about our attempts in detail later on.
But, it is a fact that doing retrieval on 1 billion documents is too big of a bite for many research groups, as it calls for nontrivial software and hardware architecture (note that it is 40 times more data than the Gov2 corpus, which I believe was the largest web crawl available to the research community so far with its 25 million documents). Therefore, a “Category B” subset of the collection is also available, consisting of “only” 50 million English pages. Some of the tracks (the Entity track for sure) will use only the Category B subset in 2009.