Last bundle of updates for 2009

I haven’t had time to post entries on my blog over the past few weeks (or even months — has it really been that long ago?). Anyway, here is a couple of things worth mentioning before 2009 is officially over.

A newer version of the EARS toolkit has been released. Major changes concern document-entity associations and faster computation of candidate models, as well as support for MS Visual Studio. See the changelog for details.

Our paper entitled Category-based Query Modeling for Entity Search, with Krisztian Balog, Marc Bron, and Maarten de Rijke as authors, has been accepted to ECIR 2010 and is available online now.

Abstract. Users often search for entities instead of documents and in this setting are willing to provide extra input, in addition to a query, such as category information and example entities. We propose a general probabilistic framework for entity search to evaluate and provide insight in the many ways of using these types of input for query modeling. We focus on the use of category information and show the advantage of a category-based representation over a term-based representation, and also demonstrate the effectiveness of category-based expansion using example entities. Our best performing model shows very competitive performance on the INEX-XER entity ranking and list completion tasks.

See also: ECIR 2010 accepted papers, posters, and demos.

The TREC Enterprise 2008 overview paper has finally been posted to the proceedings.

Happy 2010!

TREC Enterprise 2008 overview

The overview paper of the TREC 2008 Enterprise track is -finally- available. While I was not an organizer of the track, I helped out with finishing the paper; the track organizers generously awarded my contribution with a first authorship. The document still needs to undergo the NIST approval process, but I am allowed to distribute it as “draft”.
[Dowload PDF|BibTex].

Despite having my name on the overview paper, I am still wearing a participant’s hat. So the first questions that comes to mind is: How did we do? (We is team ISLA, consisting of Maarten de Rijke and me.) To cut the story short — we won! Of course, TREC (according to some people) is not a competition. I am not going to take a side on that matter (at least not in this post), so let me translate the simple “we won” statement from ordinary to scientific language: our run showed the best performance among all submissions for the expert finding task of the TREC 2008 Enterprise track. Actually, we achieved both first and second place for all metrics and for all three different versions of the official qrels (they differ in how assessor agreement was handled). Our best run employed a combination of three models: a proximity-based candidate model, a document-based model, and a Web-based variation of the candidate model; our second best run is the same, but without the Web-based component. See the details in our paper [Download PDF|BibTex].
Needless to say, I am very content with these results. Seeing that my investments into research on expert finding has resulted in the state-of-the-art feels just great.

Awarded with Victorine van Schaickprijs 2009

On the 9th of October 2009, I received the Victorine van Schaickprijs 2009 award for my PhD dissertation entitled “People Search in the Enterprise”. This award is given out yearly by the Victorine van Schaick Funds to one selected publication (journal article, book, or report) in the area of library and information sciences; it comes with a cash prize of €1500 and a bronze medal.

The Board of the Foundation has this year chosen my thesis as the winner because “its impact on the discipline and because it is of interest to a wide circle of colleagues”. Also, “The jury appreciates especially his willingness to undertake research in less explored areas of the field.” (from the Jury report).

I would like to use this opportunity to express my gratitude to my thesis supervisor, Prof. Maarten de Rijke. I would like to thank the selection committee again for this award: I am extremely pleased with this recognition of my work.

Official report of the Award Ceremony (in Dutch).

EARS released

After a period of development I am ready to release EARS to the world. EARS is an open source toolkit for entity-oriented search and discovery in large text collections. The association finding framework and models implemented in EARS were originally developed for expertise retrieval in an organizational setting, during my PhD studies. These models are robust and generic, and can be applied to finding associations between topics and entities other than people.

At present, EARS supports two main tasks: finding entities (“Which entities are associated with topic X?”) and profiling entities (“What topics is an entity associated with?”), and implements two baseline search strategies for accomplishing these tasks; these became popularly known as “Model 1” and “Model 2”.

A software system will never be finished; EARS is no exception to that rule. It, however, is an active research project with ongoing development and enhancements. A number of new models and features will be included in upcoming releases. Feedback, comments, and suggestions are always welcome.

The toolkit is available at http://code.google.com/p/ears/.

Update on the TREC Entity track

The main development that I am pleased to report is the release of the final test topics. The test set comprises 20 topics, which is less than we originally aimed for, but this is what could be achieved within the time limits. We certainly wanted to avoid extending the deadlines even further.

Since the number of queries is probably too low to support generalizable conclusions, evaluation will primarily focus on per-topic analysis of the results, rather than on average measures.
It is also worth noting that many of the “primary” entity homepages may not be included in the Category B subset of the collection. In such cases the “descriptive” pages (including the entity’s Wikipedia page) are the best available.

The test topics can be downloaded from the TREC site (you need to be a registered participant for TREC 2009 to be able to access them).

The track’s guidelines have been updated and can be considered final, although minor changes or additions are possible, should anything need clarification.

The submission deadline is Sept 21, so there is still plenty of time. In fact, this might attract some more teams to participate, given that submissions for all other TREC tracks are due by the end of August, and many of these tracks use the same collection.