ECIR’19 keynote

As the recipient of the 2018 Karen Spärck Jones Award, I was invited to give a keynote at the 41st European Conference on Information Retrieval (ECIR’19). Below are the slides of my presentation.

Entity-Oriented Search book

Entity-Oriented SearchI am pleased to announce that my Entity-Oriented Search book is now available online.

This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book.

The DBpedia-Entity v2 Test Collection

The DBpedia-Entity collection a standard test set for entity search. It is meant for evaluating retrieval systems that return a ranked list of entities in response to a free text user query. The first version of the collection (DBpedia-Entity v1) was released in 2013, based on DBpedia v3.7. It was created by assembling search queries from a number of entity-oriented benchmarking campaigns (TREC, INEX, SemSearch, etc.) and mapping relevant results to DBpedia. An updated version of the collection, DBpedia-Entity v2, has been released in 2017, as a result of a collaborative effort between the IAI group of the University of Stavanger, the Norwegian University of Science and Technology, Wayne State University, and Carnegie Mellon University. It has been published at the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17), where it received a Best Short Paper Honorable Mention Award.

DBpedia-Entity v2 is based on DBpedia version 2015-10 (specifically on the English subset) and comes with graded relevance assessments collected via crowdsourcing. We also report on the performance of a selection of retrieval methods using this collection.

The collection is available here.

ICTIR 2016 paper online

“Exploiting Entity Linking in Queries for Entity Retrieval,” an upcoming ICTIR 2016 paper by Faegheh Hasibi, Svein Erik Bratsberg, and myself is available online now, along with the source code.

The premise of entity retrieval is to better answer search queries by returning specific entities instead of documents. Many queries mention particular entities; recognizing and linking them to the corresponding entry in a knowledge base is known as the task of entity linking in queries. In this paper we make a first attempt at bringing together these two, i.e., leveraging entity annotations of queries in the entity retrieval model. We introduce a new probabilistic component and show how it can be applied on top of any term-based entity retrieval model that can be emulated in the Markov Random Field framework, including language models, sequential dependence models, as well as their fielded variations. Using a standard entity retrieval test collection, we show that our extension brings consistent improvements over all baseline methods, includ- ing the current state-of-the-art. We further show that our extension is robust against parameter settings.

Update (16/09): Our paper received the Best Paper Honorable Mention Award at the conference. So it is definitely worth checking out ;)

A Test Collection for Entity Search in DBpedia

With this SIGIR ’13 short paper, we try to address some of the action points that were identified as as important priorities for entity-oriented and semantic search at the JIWES workshop held at SIGIR ’12 (see the detailed workshop report). Namely: (A1) Getting more representative information needs and favoring long queries over short ones. (A2) Limiting search to a smaller, fixed set of entity types (as opposed to arbitrary types of entities). (A3) Using test collections that integrate both structured and unstructured information about entities.

An IR test collection has three main ingredients: a data collection, a set of queries, and corresponding relevance judgments. We propose to use DBpedia as the data collection; DBpedia is a community effort to extract structured information from Wikipedia. It is one of the most comprehensive knowledge bases on the web, describing 3.64M entities (in version 3.7). We took entity-oriented queries from a number of benchmarking evaluation campaigns, synthesized them into a single query set, and mapped known relevant answers to DBpedia. This mapping involved a series of not-too-exciting yet necessary data cleansing steps, such as normalizing URIs, replacing redirects, removing duplicates, and filtering out non-entity results. In the end, we have 485 queries with an average of 27 relevant entities per query.

Now, let’s see how this relates to the action points outlined above. (A1) We consider a broad range of information needs, ranging from short keyword queries to natural language questions. The average query length, computed over the whole query set, is 5.3 terms—more than double the length of typical web search queries (which is around 2.4 terms). (A2) DBpedia has a consistent ontology comprising of 320 classes, organized into a 6 levels deep hierarchy; this allows for the incorporation of type information at different granularities. (A3) As DBpedia is extracted from Wikipedia, there is more textual content available for those who wish to combine structured and unstructured information about entities.

The paper also includes a set of baseline results using variants of two popular retrieval models: language models and BM25. We found that the various query sub-sets (originating from different benchmarking campaigns) exhibit different levels of difficulty—this was expected. What was rather surprising, however, is that none of the more advanced multi-field variants could really improve over the simplest possible single-field approach. We observed that a large number of topics were affected, but the number of topics helped/hurt was about the same. The breakdowns by various query-subsets also suggest that there is no one-size-fits-all way to effectively address all types of information needs represented in this collection. This phenomenon could give rise to novel approaches in the future; for example, one could first identify the type of the query and then choose the retrieval model accordingly.

The resources developed as part of this study are made available here. You are also welcome to check out the poster I presented at SIGIR ’13.
If you have (or planning to have) a paper that uses this collection, I would be happy to hear about it!