As the recipient of the 2018 Karen Spärck Jones Award, I was invited to give a keynote at the 41st European Conference on Information Retrieval (ECIR’19). Below are the slides of my presentation.
evaluation
Two journal papers on online evaluation
I am a co-author of two journal papers that appeared in the special issues of the Journal of Data and Information Quality on Reproducibility in IR.
The article entitled “OpenSearch: Lessons Learned from an Online Evaluation Campaign” by Jagerman et al. reports on our experience with TREC OpenSearch, an online evaluation campaign that enabled researchers to evaluate their experimental retrieval methods using real users of a live website. TREC OpenSearch focused on the task of ad hoc document retrieval within the academic search domain. We describe our experimental platform, which is based on the living labs methodology, and report on the experimental results obtained. We also share our experiences, challenges, and the lessons learned from running this track in 2016 and 2017.
The article entitled “Evaluation-as-a-Service for the Computational Sciences: Overview and Outlook” by Hopfgartner et al. discusses the Evaluation-as-a-Service paradigm, where data sets are not provided for download, but can be accessed via application programming interfaces (APIs), virtual machines (VMs), or other possibilities to ship executables. We summarize and compare current approaches, consolidate the experiences of these approaches, and outline next steps toward sustainable research infrastructures.
Evaluating document filtering systems over time
Performance of three systems over time. Systems A and B degrade, while System C improves over time, but they all have the same average performance over the entire period. We express the change in system performance using the derivative of the fitted line (in orange) and compare performance at what we call the “estimated end-point” (the large orange dots).
Our IPM paper “Evaluating document filtering systems over time” with Tom Kenter and Maarten de Rijke as co-authors is available online. In this paper we propose a framework for measuring the performance of document filtering systems. Such systems, up to now, have been evaluated in terms of traditional metrics like precision, recall, MAP, nDCG, F1 and utility. We argue that these metrics lack support for the temporal dimension of the task. We propose a time-sensitive way of measuring performance by employing trend estimation. In short, the performance is calculated for batches, a trend line is fitted to the results, and the estimated performance of systems at the end of the evaluation period is used to compare systems. To demonstrate the results of our proposed evaluation methodology, we analyze the runs submitted to the Cumulative Citation Recommendation task of the 2012 and 2013 editions of the TREC Knowledge Base Acceleration track, and show that important new insights emerge.