TREC Entity 2010 draft guidelines

The draft guidelines for the 2010 edition of the track have been posted on the track’s website.

In 2010, Related Entity Finding (REF) runs as the main task of the track. A number of changes has been made to the previous edition. We also attempted to clarify issues, such as what is and what is not an entity homepage.
In addition, the track introduces a second challenge, entity list completion (ELC), which will run as a pilot task.

Your feedback is not only welcomed, but encouraged! Post them as comments on the guidelines page or send them to the mailing list.

TREC Entity: overview of 2009 and plans for 2010

The Entity track overview paper has been added to the TREC 2009 online Proceedings [direct link to the pdf].
The track continues in 2010. An overview of what happened at the 2009 TREC conference (entity wise), along with plans for the 2010 edition has been published on the track’s website. There is some discussion on the mailing list too.

Looking back on 2009 and forward to 2010

This year has started quite intensely, with a research grant proposal deadline already on week 1. But, I’d like to take a moment and look back on 2009, before rushing on to the next deadline (which is SIGIR 2010, in less than a week away).

First, I’d like to make an honorable mention of four of my colleagues/co-authors who have defended their PhD and became doctors in 2009. They are (in temporal order):

They all did a great job, congrats!

A significant milestone of 2009 was the launch of the TREC Entity track. The overall aim of this new track is to perform entity-related search on Web data. The track defines entities as “typed search results” or “things”, represented by their homepages on the web. In other words, our working definition of an entity is “something with a homepage”, and searching for entities thus corresponds to ranking these homepages. (As a sidenote: I am well aware that this definition of an entitiy is far from perfect, yet, the URL of a homepage is the best entity identifier we could come up with so far.)
The first year of the track investigated the problem of related entity finding:

Given an input entity, by its name and homepage, the type of the target entity, as well as the nature of their relation, described in free text, find related entities that are of target type, standing in the required relation to the input entity.

This task can be seen as a mixture of Question Answering (specifically, the QA list task) and homepage finding. In the first year, we limited the track’s scope to searches for instances of the organizations, people, and product entity types.
Thirteen groups participated and submitted a total of 41 runs; this demonstrates clear interest, and, I think, is quite decent for the first edition of the track (beating well established tracks, like the Blog track, in terms of the number of participating teams).
The track continues in 2010, where the main task will (again) be related entity finding, with moderate changes and more topics. We are also planning to feature another subtask (more details will follow on the track’s mailing list).

Another important development was the release of the EARS toolkit. EARS stands for Entity and Association Retrieval System, and is an open source implementation of entity-topic association finding models, used so far mostly in the context of expertise retrieval, but also for other tasks, for example blog distillation. While currently the functionality of EARS is limited to two baseline models (“Model 1” and “Model 2”), a number of additions are coming in future releases, throughout 2010. Most notably, proximity-based variations of existing models, and methods for finding entity-entity relations (i.e., addressing the related entity finding task defined above).

A yearly evaluation would not be complete without mentioning citation counts. Some say, citation is to publication as price is to stock. So, I do it this time in a NASDAQ-stlye. Total citation count has doubled (from 214 to 433) and my H-index has also increased, from 8 to 11. The top performing papers are shown below.

Citation counts (as of Jan 1) 2009 2010 +/-
1. Formal models for expert finding in enterprise corpora (SIGIR 2006) 74 136
2. Finding experts and their details in e-mail corpora (WWW 2006) 27 40
3. Broad expertise retrieval in sparse data environments (SIGIR 2007) 13 36 Up
4. Determining Expert Profiles (With an Application to Expert Finding) (IJCAI 2007) 10 25 Up
5. Why are they excited? identifying and explaining spikes in blog mood level (EACL 2006) 13 21
6. Language Modeling Approaches for Enterprise Tasks (TREC 2005) 16 20 Down

+/- denotes the position change in the relative ordering of my papers, according to the number of citations.

According to these numbers, our SIGIR 2006 paper is (still) a massive leader, and keeps on following the rich-getting-richer trend. A more interesting observation is that our expert profiling work seems to have gained impact and attention in the past year, as citation counts for the two profiling papers have almost tripled in 2009. This is good news, especially in the light of some ongoing work we are performing in this area.

That’s it for now (longest post ever). I wish a successful 2010 to everybody (and good luck to those with a SIGIR deadline)!

Update on the TREC Entity track

The main development that I am pleased to report is the release of the final test topics. The test set comprises 20 topics, which is less than we originally aimed for, but this is what could be achieved within the time limits. We certainly wanted to avoid extending the deadlines even further.

Since the number of queries is probably too low to support generalizable conclusions, evaluation will primarily focus on per-topic analysis of the results, rather than on average measures.
It is also worth noting that many of the “primary” entity homepages may not be included in the Category B subset of the collection. In such cases the “descriptive” pages (including the entity’s Wikipedia page) are the best available.

The test topics can be downloaded from the TREC site (you need to be a registered participant for TREC 2009 to be able to access them).

The track’s guidelines have been updated and can be considered final, although minor changes or additions are possible, should anything need clarification.

The submission deadline is Sept 21, so there is still plenty of time. In fact, this might attract some more teams to participate, given that submissions for all other TREC tracks are due by the end of August, and many of these tracks use the same collection.

The good and the bad news

A quick update on the TREC Entity track, which reminds me of the classical good news-bad news situation. The good news is that we have just reached 100 members on the TREC entity mailing list. The bad news is that almost all of them are mute.
On a more serious account, the track guidelines need to get finalized very soon. One way of interpreting the silence is that people are happy with the proposed task and all details are clear. There may be other (less positive) interpretations. Whichever the case might be, in the absence of discussion, organizers will simply dictate what is to be done.