Two evaluation campaigns related to entity/expert search

The CLEF 2010 labs will feature two evaluation campaigns that are potentially of interest to people working in the area of entity/people/expert search.

The third WePS Evaluation Workshop (WePS3) focuses on two tasks related to web entity search:

  • Task 1: Clustering and Attribute Extraction for Web People Search.
    Given a set of web search results for a person name, the task is to cluster the pages according to the different people sharing the name and extract certain biographical attributes for each person. [details]
  • Task 2: Name ambiguity resolution for Online Reputation Management.
    Given a set of Twitter entries containing an (ambiguous) company name, and given the home page of the company, the task is to discriminate entries that do not refer to the company. Entries will be given in two languages: English and Spanish. [details]

The Cross-lingual Expert Search (CriES) workshop addresses the problem of multi-lingual expert search in social media environments. The workshop also includes a pilot challenge, which is very much like the expert finding task at the TREC Enterprise track: given a document collection and a query topic, return a ranked list of experts, who are likely to be experts on the topic. However, the document collection is a multilingual social environment (Yahoo! Answers) and topics come in 4 different languages (English, German, French, Spanish).

Work hard, relax hard!

This picture gives you a bit of an idea how my week in the run-up for SIGIR deadline looked like (I also have a nice collection of energy drink cans at the office). But, that is finally over, and I managed to get in my submissions on time. That means I’m done with the work hard part, now it’s time for some serious relaxing and recharging of batteries.

I’ll be on vacation until Feb 10, and this time, I won’t be checking my emails. If anything urgent comes up… that has to wait, I’ll deal with it when I’m back.
I’m off to relax!

Looking back on 2009 and forward to 2010

This year has started quite intensely, with a research grant proposal deadline already on week 1. But, I’d like to take a moment and look back on 2009, before rushing on to the next deadline (which is SIGIR 2010, in less than a week away).

First, I’d like to make an honorable mention of four of my colleagues/co-authors who have defended their PhD and became doctors in 2009. They are (in temporal order):

They all did a great job, congrats!

A significant milestone of 2009 was the launch of the TREC Entity track. The overall aim of this new track is to perform entity-related search on Web data. The track defines entities as “typed search results” or “things”, represented by their homepages on the web. In other words, our working definition of an entity is “something with a homepage”, and searching for entities thus corresponds to ranking these homepages. (As a sidenote: I am well aware that this definition of an entitiy is far from perfect, yet, the URL of a homepage is the best entity identifier we could come up with so far.)
The first year of the track investigated the problem of related entity finding:

Given an input entity, by its name and homepage, the type of the target entity, as well as the nature of their relation, described in free text, find related entities that are of target type, standing in the required relation to the input entity.

This task can be seen as a mixture of Question Answering (specifically, the QA list task) and homepage finding. In the first year, we limited the track’s scope to searches for instances of the organizations, people, and product entity types.
Thirteen groups participated and submitted a total of 41 runs; this demonstrates clear interest, and, I think, is quite decent for the first edition of the track (beating well established tracks, like the Blog track, in terms of the number of participating teams).
The track continues in 2010, where the main task will (again) be related entity finding, with moderate changes and more topics. We are also planning to feature another subtask (more details will follow on the track’s mailing list).

Another important development was the release of the EARS toolkit. EARS stands for Entity and Association Retrieval System, and is an open source implementation of entity-topic association finding models, used so far mostly in the context of expertise retrieval, but also for other tasks, for example blog distillation. While currently the functionality of EARS is limited to two baseline models (“Model 1” and “Model 2”), a number of additions are coming in future releases, throughout 2010. Most notably, proximity-based variations of existing models, and methods for finding entity-entity relations (i.e., addressing the related entity finding task defined above).

A yearly evaluation would not be complete without mentioning citation counts. Some say, citation is to publication as price is to stock. So, I do it this time in a NASDAQ-stlye. Total citation count has doubled (from 214 to 433) and my H-index has also increased, from 8 to 11. The top performing papers are shown below.

Citation counts (as of Jan 1) 2009 2010 +/-
1. Formal models for expert finding in enterprise corpora (SIGIR 2006) 74 136
2. Finding experts and their details in e-mail corpora (WWW 2006) 27 40
3. Broad expertise retrieval in sparse data environments (SIGIR 2007) 13 36 Up
4. Determining Expert Profiles (With an Application to Expert Finding) (IJCAI 2007) 10 25 Up
5. Why are they excited? identifying and explaining spikes in blog mood level (EACL 2006) 13 21
6. Language Modeling Approaches for Enterprise Tasks (TREC 2005) 16 20 Down

+/- denotes the position change in the relative ordering of my papers, according to the number of citations.

According to these numbers, our SIGIR 2006 paper is (still) a massive leader, and keeps on following the rich-getting-richer trend. A more interesting observation is that our expert profiling work seems to have gained impact and attention in the past year, as citation counts for the two profiling papers have almost tripled in 2009. This is good news, especially in the light of some ongoing work we are performing in this area.

That’s it for now (longest post ever). I wish a successful 2010 to everybody (and good luck to those with a SIGIR deadline)!

Last bundle of updates for 2009

I haven’t had time to post entries on my blog over the past few weeks (or even months — has it really been that long ago?). Anyway, here is a couple of things worth mentioning before 2009 is officially over.

A newer version of the EARS toolkit has been released. Major changes concern document-entity associations and faster computation of candidate models, as well as support for MS Visual Studio. See the changelog for details.

Our paper entitled Category-based Query Modeling for Entity Search, with Krisztian Balog, Marc Bron, and Maarten de Rijke as authors, has been accepted to ECIR 2010 and is available online now.

Abstract. Users often search for entities instead of documents and in this setting are willing to provide extra input, in addition to a query, such as category information and example entities. We propose a general probabilistic framework for entity search to evaluate and provide insight in the many ways of using these types of input for query modeling. We focus on the use of category information and show the advantage of a category-based representation over a term-based representation, and also demonstrate the effectiveness of category-based expansion using example entities. Our best performing model shows very competitive performance on the INEX-XER entity ranking and list completion tasks.

See also: ECIR 2010 accepted papers, posters, and demos.

The TREC Enterprise 2008 overview paper has finally been posted to the proceedings.

Happy 2010!

TREC Enterprise 2008 overview

The overview paper of the TREC 2008 Enterprise track is -finally- available. While I was not an organizer of the track, I helped out with finishing the paper; the track organizers generously awarded my contribution with a first authorship. The document still needs to undergo the NIST approval process, but I am allowed to distribute it as “draft”.
[Dowload PDF|BibTex].

Despite having my name on the overview paper, I am still wearing a participant’s hat. So the first questions that comes to mind is: How did we do? (We is team ISLA, consisting of Maarten de Rijke and me.) To cut the story short — we won! Of course, TREC (according to some people) is not a competition. I am not going to take a side on that matter (at least not in this post), so let me translate the simple “we won” statement from ordinary to scientific language: our run showed the best performance among all submissions for the expert finding task of the TREC 2008 Enterprise track. Actually, we achieved both first and second place for all metrics and for all three different versions of the official qrels (they differ in how assessor agreement was handled). Our best run employed a combination of three models: a proximity-based candidate model, a document-based model, and a Web-based variation of the candidate model; our second best run is the same, but without the Web-based component. See the details in our paper [Download PDF|BibTex].
Needless to say, I am very content with these results. Seeing that my investments into research on expert finding has resulted in the state-of-the-art feels just great.