ECIR’16 contributions

Last Sunday, Anne Schuth and I gave a tutorial on Living Labs for Online Evaluation. The tutorial’s homepage contains all the slides and reference material.

Experimental evaluation has always been central to Information Retrieval research. The field is increasingly moving towards online evaluation, which involves experimenting with real, unsuspecting users in their natural task environments, a so-called living lab. Specifically, with the recent introduction of the Living Labs for IR Evaluation initiative at CLEF and the OpenSearch track at TREC, researchers can now have direct access to such labs. With these benchmarking platforms in place, we believe that online evaluation will be an exciting area to work on in the future. This half-day tutorial aims to provide a comprehensive overview of the underlying theory and complement it with practical guidance.

Today, Faegheh Hashibi is presenting our work on the reproducibility of the TAGME Entity Linking System. The full paper and resources for this work are available online.

Among the variety of approaches proposed for entity linking, the TAGME system has gained due attention and is considered a must-have baseline. In this paper, we examine the repeatability, reproducibility, and generalizability of TAGME, by comparing results obtained from its public API with (re)implementations from scratch. We find that the results reported in the paper cannot be repeated due to unavailability of data sources. Part of the results are reproducible only through the provided API, while the rest are not reproducible. We further show that the TAGME approach is generalizable to the task of entity linking in queries. Finally, we provide insights gained during this process and formulate lessons learned to inform future reducibility efforts.

Living Labs at TREC, CLEF, ECIR

There is a number of exciting developments around the living labs initiative that Anne Schuth and I have been working on. The goal that we are looking for in this activity is to enable researchers to perform online experiments, i.e., in-situ evaluation with actual users of a live site, as opposed to relying exclusively on paid assessors (or simulated users). We believe that this is nothing less than a paradigm shift. We define this new evaluation paradigm as follows:

The experimentation platform is an existing search engine. Researchers have the opportunity to replace components of this search engine and evaluate these components using interactions with real, unsuspecting users of this search engine.

CLEF LL4IR'16Our first pilot campaign, LL4IR, co-organized by Liadh Kelly, ran at CLEF earlier this year with two use-cases: product search and web search. See our (extended) overview paper for details. LL4IR will run again at CLEF next year with the same use-cases. Thanks to CLEF, our API has by now been used extensively and tested thoroughly: it has successfully processed over 0.5M query issues, coming from real users of the two search engines involved.
TREC OpenSearchBased on the positive feedback we received from researchers as well as commercial partners, we decided it’s time to go big, that is, TREC. Getting in to TREC is no small feat, given that the number of tracks is limited to 8 and a large number of proposals compete for the slot(s) that may get freed up each year. We are very pleased that TREC accepted our proposal and this attests to the importance of the direction we’re heading. At TREC OpenSearch we’re focusing on academic search. It’s an interesting domain as it offers a low barrier of entry with ad-hoc document retrieval, and at the same time is a great playground for current research problems, including semantic matching (to overcome vocabulary mismatches), semantic search (retrieving not just documents but authors, institutes, conferences, etc.), and recommendations (related literature). We are in the process of finalizing the agreements with academic search engines and plan to have our guidelines completed by March 2016.
LiLa'16Using our API is easy (with documentation and examples available online), but it is different from the traditional TREC-like way of evaluation. Therefore, Anne and I will be giving a tutorial, LiLa, at ECIR in Padova in March, 2016. The timing is ideal in that it’s well before the TREC and CLEF deadlines and allows prospective participants to familiarize themselves with both the underlying theory and the practicalities of our methodology.
Last but not least, we are thankful to 904Labs for hosting our API infrastructure.

Living Labs developments

There have been a number of developments over the past months around our living labs for IR evaluation efforts.

We had a very successful challenge workshop in Amsterdam in June, thanks to the support we received from ELIAS, ESF, and ILPS. The scientific report summarizing the event is available online.

There are many challenges associated with operationalizing a living labs benchmarking campaign. Chief of these are incorporating results from experimental search systems into live production systems, and obtaining sufficiently many impressions from relatively low traffic sites. We propose that frequent (head) queries can be used to generate result lists offline, which are then interleaved with results of the production system for live evaluation. The choice of head queries is critical because (1) it removes a harsh requirement of providing rankings in real-time for query requests and (2) it ensures that experimental systems receive enough impressions, on the same set of queries, for a meaningful comparison. This idea is described in detail in an upcoming CIKM’14 short paper: Head First: Living Labs for Ad-hoc Search Evaluation.

A sad, but newsworthy development was that our CIKM’14 workshop got cancelled. It was our plan to organize a living labs challenge as part of the workshop. That challenge cannot be run as originally planned. Now we have something much better.

Living Labs for IR Evaluation (LL4IR) will run as a Lab at CLEF 2015 along the tagline “Give us your ranking, we’ll have it clicked!” The first edition of the lab will focus on three specific use-cases: (1) product search (on an e-commerce site), (2) local domain search (on a university’s website), (3) web search (through a commercial web search engine). See futher details here.

Living Labs for Information Retrieval Evaluation

Evaluation is a central aspect of information retrieval (IR) research. In the past few years, a new evaluation methodology known as living labs has been proposed as a way for researchers to be able to perform in-situ evaluation. This is not new, you might say; major web search engines have been doing it for serveral years already. While this is very true, it also means that this type of experimentation, with real users performing tasks using real-world applications, is only available to those selected few who are involved with the research labs of these organizations. There has been a lot of complaining about the “data divide” between industry and academia; living labs might be a way to bridge that.

The Living Labs for Information Retrieval Evaluation (LL’13) workshop at CIKM last year was a first attempt to bring people, both from academia and industry, together to discuss challenges and to formulate practical next steps. The workshop was successful in identifying and documenting possible further directions. See the preprint of the workshop summary.

The second edition of the iving Labs for IR workshop (LL’14), will run at CIKM this year. Our main goals are to continue our community building efforts around living labs for IR and to pursue the directions set out at LL’13. Having a community benchmarking platform with shared tasks would be a key catalyst in enabling people to make progress in this area. This is exactly what we are trying to set up for LL’14, in the form of a challenge (with the ultimate goal of turning it into a TREC, NTCIR or CLEF track in the future).

The challenge focuses on two specific use-cases: product search and local domain search. The basic idea is that participants receive a set of 100 frequent queries along with candidate results for these queries, and some general collection statistics. They are then expected to produce rankings for each query and to upload these rankings through an API. These rankings are evaluated online, on real users, and the results of these evaluations are made available to the participants, again, through an API.

In preparation for this challenge, we are organising a challenge workshop in Amsterdam on the 6th of June. The programme includes invited talks and a “hackathon.” We have a limited number of travel grants available (for those coming from outside The Netherlands and coming from academia) to cover travel and accommodation expenses. These are available on a “first come first served” basis (at most one per institute). If you would like to make use of this opportunity, please let us know as soon as possible.

More details may be found on our brand-new website:

Call for Demos | Living Labs for IR workshop

The Living Labs for Information Retrieval Evaluation (LL’13) workshop at CIKM’13 invites researchers and practitioners to present their innovative prototypes or practical developments in a dedicated demo track. Demo submissions must be based on an implemented system that pursues one or more aspects relevant to the interest areas of the workshop.

Authors are strongly encouraged to target scenarios that are rooted in real-world applications. One way to think about this is by considering the following: as a company operating a website/service/application, what methods could allow various academic groups to experiment with specific components of this website/service/application?
In particular, we seek prototypes that define specific component(s) in the context of some website/service/application, and allow for the testing and evaluation of alternative methods for that component. One example is search within a specific vertical (such as product or travel search engine), but we encourage authors to think outside the (search) box.

All accepted demos will be evaluated and considered for the Best Demo Award.
The Best Demo Award winner will receive an award of 750 EUR, offered by the ‘Evaluating Information Access Systems’ (ELIAS) ESF Research Networking Programme. The award can be used to cover travel, accommodation or other expenses in relation to attending and/or demo’ing at LL’13.

The submission deadline for demos and for all other contributions is July 22 (extended).

Further details can be found on the workshop website.