Living Labs at TREC, CLEF, ECIR

There is a number of exciting developments around the living labs initiative that Anne Schuth and I have been working on. The goal that we are looking for in this activity is to enable researchers to perform online experiments, i.e., in-situ evaluation with actual users of a live site, as opposed to relying exclusively on paid assessors (or simulated users). We believe that this is nothing less than a paradigm shift. We define this new evaluation paradigm as follows:

The experimentation platform is an existing search engine. Researchers have the opportunity to replace components of this search engine and evaluate these components using interactions with real, unsuspecting users of this search engine.

CLEF LL4IR'16Our first pilot campaign, LL4IR, co-organized by Liadh Kelly, ran at CLEF earlier this year with two use-cases: product search and web search. See our (extended) overview paper for details. LL4IR will run again at CLEF next year with the same use-cases. Thanks to CLEF, our API has by now been used extensively and tested thoroughly: it has successfully processed over 0.5M query issues, coming from real users of the two search engines involved.
TREC OpenSearchBased on the positive feedback we received from researchers as well as commercial partners, we decided it’s time to go big, that is, TREC. Getting in to TREC is no small feat, given that the number of tracks is limited to 8 and a large number of proposals compete for the slot(s) that may get freed up each year. We are very pleased that TREC accepted our proposal and this attests to the importance of the direction we’re heading. At TREC OpenSearch we’re focusing on academic search. It’s an interesting domain as it offers a low barrier of entry with ad-hoc document retrieval, and at the same time is a great playground for current research problems, including semantic matching (to overcome vocabulary mismatches), semantic search (retrieving not just documents but authors, institutes, conferences, etc.), and recommendations (related literature). We are in the process of finalizing the agreements with academic search engines and plan to have our guidelines completed by March 2016.
LiLa'16Using our API is easy (with documentation and examples available online), but it is different from the traditional TREC-like way of evaluation. Therefore, Anne and I will be giving a tutorial, LiLa, at ECIR in Padova in March, 2016. The timing is ideal in that it’s well before the TREC and CLEF deadlines and allows prospective participants to familiarize themselves with both the underlying theory and the practicalities of our methodology.
Last but not least, we are thankful to 904Labs for hosting our API infrastructure.

Living Labs developments

There have been a number of developments over the past months around our living labs for IR evaluation efforts.

We had a very successful challenge workshop in Amsterdam in June, thanks to the support we received from ELIAS, ESF, and ILPS. The scientific report summarizing the event is available online.

There are many challenges associated with operationalizing a living labs benchmarking campaign. Chief of these are incorporating results from experimental search systems into live production systems, and obtaining sufficiently many impressions from relatively low traffic sites. We propose that frequent (head) queries can be used to generate result lists offline, which are then interleaved with results of the production system for live evaluation. The choice of head queries is critical because (1) it removes a harsh requirement of providing rankings in real-time for query requests and (2) it ensures that experimental systems receive enough impressions, on the same set of queries, for a meaningful comparison. This idea is described in detail in an upcoming CIKM’14 short paper: Head First: Living Labs for Ad-hoc Search Evaluation.

A sad, but newsworthy development was that our CIKM’14 workshop got cancelled. It was our plan to organize a living labs challenge as part of the workshop. That challenge cannot be run as originally planned. Now we have something much better.

Living Labs for IR Evaluation (LL4IR) will run as a Lab at CLEF 2015 along the tagline “Give us your ranking, we’ll have it clicked!” The first edition of the lab will focus on three specific use-cases: (1) product search (on an e-commerce site), (2) local domain search (on a university’s website), (3) web search (through a commercial web search engine). See futher details here.

Entity-oriented evaluation efforts in 2012

I’ve got a couple of mails asking about TREC Entity 2012. For those that don’t know it yet: the track won’t run in 2012.

In a nutshell, the level of participation in 2011 was much lower than we would have wished, especially for the REF task; as a consequence, the resulting pools are probably not of great quality. The ELC task was more successful in terms of the number of submissions, but I don’t know about the quality; the relevance assessments are yet to be done there (this has unfortunately been long delayed, mostly because of my lack of time for finishing up the assessment interface). Apart from the ELC results, last year’s efforts has been documented in the 2011 track overview paper.

Why not continue in 2012? We did not see a point in repeating the related entity finding task; over the three years of the track we managed to build a healthy-sized topic set for those that want to work on this. And, we simply didn’t have a great idea for a “next big thing.” The track is not necessarily over, I’d prefer to say it’s on hold.

There is, however, a number of entity-related evaluation campaigns running in 2012. I compiled a list of these (and will try to keep it updated).

  • TREC Knowledge Base Acceleration (KBA) This is a new TREC track. The first edition will feature a special filtering task: given an incoming text stream (news and social media content) and a target entity from a knowledge base (for now: people, specified by their Freebase and Wikipedia entries), generate a score for each item (“document”) based on how “pertinent” it is to the target KB node. The first month of the incoming stream will come with human-generated labels and can be used as training data; the latter months are for evaluation.
  • INEX Data Centric Track (Not sure it’ll run in 2012, as the call is not out yet.) Last year’s track used the IMDB data collection and defined two task. The ad hoc search task has informational requests to be answered by a ranked list of IMDB entities (specifically, persons or movies). The faceted search task asks for a restricted list of facets and facet-values to help the user refine the query through a multi-step search session.
  • TAC Knowledge Base Population (KBP) The track investigates tasks related to extracting information about entities with reference to an external knowledge source (Wikipedia infoboxes). KBP 2011 had three tasks: entity-linking: given an entity name (person, organization, or geopolitical entity) and a document containing that name, determine the KB node for that entity or add a new node for the entity if it is not already in the KB; slot-filling: given a named entity and a pre-defined set of attributes (“slots”) for the entity type, augment a KB node for that entity by extracting all new learnable slot values from a large corpus of documents; temporal slot-filling: similar to the regular slot-filling task, but also requests time intervals to be specified for each extracted slot value.
  • CLEF RepLab This new CLEF Lab is set out to study the problem of online reputation management (ORM); in a sense this effort continues and takes the WePS3 ORM task to the next level by defining a longer-term research agenda and by setting up various tasks within the problem domain. The website is not up yet, but according to the CLEF Labs flyer two tasks will be evaluated on Twitter data: a monitoring task, where the goal is to thematically cluster tweets including a company’s name (this seems the exact same as the WePS3 ORM task); a profiling task, where the goal is to annotate tweets according to their polarity (i.e., whether they have positive or negative implications for the company’s reputation).

Feel free to send me a message about anything that might be added here.

Yahoo! Semantic Search Challenge

The 3rd Semantic Search Workshop (SemSearch’10) organized an Entity Search Challenge last year (see my notes from the event). This competition is being organized this year again. There are two tasks: entity search (queries refer to a particular entity) and list search (complex queries with multiple possible answers). The collection is the Billion Triple Challenge 2009 (BTC-2009) data set, which is the same as last year. Also, this is the data set we used at the TREC Entity track in 2010. So I encourage all TREC Entity participants to take part, and vice versa.
There is even cash price of $500 offered by Yahoo! for the winner of each task; it’s more of a symbolic reward than a real remuneration ;-) but anyways, it’s not the money we academics are after, is it?
The submission deadline is Mar 21. For more details see: