ICTIR 2016 paper online

“Exploiting Entity Linking in Queries for Entity Retrieval,” an upcoming ICTIR 2016 paper by Faegheh Hasibi, Svein Erik Bratsberg, and myself is available online now, along with the source code.

The premise of entity retrieval is to better answer search queries by returning specific entities instead of documents. Many queries mention particular entities; recognizing and linking them to the corresponding entry in a knowledge base is known as the task of entity linking in queries. In this paper we make a first attempt at bringing together these two, i.e., leveraging entity annotations of queries in the entity retrieval model. We introduce a new probabilistic component and show how it can be applied on top of any term-based entity retrieval model that can be emulated in the Markov Random Field framework, including language models, sequential dependence models, as well as their fielded variations. Using a standard entity retrieval test collection, we show that our extension brings consistent improvements over all baseline methods, includ- ing the current state-of-the-art. We further show that our extension is robust against parameter settings.

Update (16/09): Our paper received the Best Paper Honorable Mention Award at the conference. So it is definitely worth checking out ;)

A Test Collection for Entity Search in DBpedia

With this SIGIR ’13 short paper, we try to address some of the action points that were identified as as important priorities for entity-oriented and semantic search at the JIWES workshop held at SIGIR ’12 (see the detailed workshop report). Namely: (A1) Getting more representative information needs and favoring long queries over short ones. (A2) Limiting search to a smaller, fixed set of entity types (as opposed to arbitrary types of entities). (A3) Using test collections that integrate both structured and unstructured information about entities.

An IR test collection has three main ingredients: a data collection, a set of queries, and corresponding relevance judgments. We propose to use DBpedia as the data collection; DBpedia is a community effort to extract structured information from Wikipedia. It is one of the most comprehensive knowledge bases on the web, describing 3.64M entities (in version 3.7). We took entity-oriented queries from a number of benchmarking evaluation campaigns, synthesized them into a single query set, and mapped known relevant answers to DBpedia. This mapping involved a series of not-too-exciting yet necessary data cleansing steps, such as normalizing URIs, replacing redirects, removing duplicates, and filtering out non-entity results. In the end, we have 485 queries with an average of 27 relevant entities per query.

Now, let’s see how this relates to the action points outlined above. (A1) We consider a broad range of information needs, ranging from short keyword queries to natural language questions. The average query length, computed over the whole query set, is 5.3 terms—more than double the length of typical web search queries (which is around 2.4 terms). (A2) DBpedia has a consistent ontology comprising of 320 classes, organized into a 6 levels deep hierarchy; this allows for the incorporation of type information at different granularities. (A3) As DBpedia is extracted from Wikipedia, there is more textual content available for those who wish to combine structured and unstructured information about entities.

The paper also includes a set of baseline results using variants of two popular retrieval models: language models and BM25. We found that the various query sub-sets (originating from different benchmarking campaigns) exhibit different levels of difficulty—this was expected. What was rather surprising, however, is that none of the more advanced multi-field variants could really improve over the simplest possible single-field approach. We observed that a large number of topics were affected, but the number of topics helped/hurt was about the same. The breakdowns by various query-subsets also suggest that there is no one-size-fits-all way to effectively address all types of information needs represented in this collection. This phenomenon could give rise to novel approaches in the future; for example, one could first identify the type of the query and then choose the retrieval model accordingly.

The resources developed as part of this study are made available here. You are also welcome to check out the poster I presented at SIGIR ’13.
If you have (or planning to have) a paper that uses this collection, I would be happy to hear about it!

Entity Linking and Retrieval tutorial at WWW’13

Earlier this week, Edgar Meij, Daan Odijk, and I gave a half-day tutorial at the WWW’13 conference on Entity Linking and Retrieval.

The tutorial consists of three parts: (i) entity linking (Edgar), (ii) entity retrieval (me), and a hands-on lab session (Daan). The hands-on session is further subdivided into entity linking and entity retrieval parts. The slides are made available on github. We also created a Mendeley group with all the papers that were discussed. The tags, entity linking and entity retrieval, hint the part of the tutorial to which each paper belongs. We intend to maintain and expand this repository, so it might be useful for you to follow this group.

Given that this was a half-day tutorial, we had to be quite selective in what we presented. A full-day version of the same tutorial will be given by us at SIGIR’13 in July. If you have suggestions for improvements and pointers to papers, approaches, services, etc. that we could/should cover (yes, this includes your own work) then don’t hesitate to get in touch with us!

First picks from 2013

It’s almost mid Feb, so I won’t even attempt to make it a Happy New Year entry. And I’ll keep it short.

As of Jan 1 this year, I’m working as an Associate Professor at the University of Stavanger. Don’t look for the IR group’s homepage, there is no such thing. Yet ;)

Briefly about (some of) my recent work. Not surprisingly, it’s all related to entities. In a SPIRE’12 paper we study ad-hoc entity retrieval in Linked Data in a distributed setting, with focus on the problems of collection ranking and collection selection. In a short position paper, written for the ESAIR’12 workshop, we discuss how to make entity retrieval temporally-aware, using semantic knowledge bases that are enriched with temporal information (like YAGO2). In a CIKM’12 poster we introduce the task of target type identification for entity-oriented queries, where types are organized hierarchically. We also made all related resources publicly available.
Most recently, just earlier this week, I gave a lecture on Semistructured Data Search at the PROMISE Winter School. At some point in the not-too-distant future there might be a written version of this material. So if you have any feedback, comments, suggestions, etc. please don’t hesitate to contact me.

Finally, I decided to set up and maintain a separate page with a list of entity-oriented benchmarking campaigns, workshops, and journal special issues. I hope people will find it useful. If you have a relevant piece to be added here, let me know.

JIWES summary

The First Joint International Workshop on Entity-oriented and Semantic Search (JIWES) was held on Aug 16, 2012 in Portland, Oregon, USA, in conjunction with the 35th Annual International ACM SIGIR Conference (SIGIR 2012). The objective for the workshop was to bring together academic researchers and industry practitioners working on entity-oriented search to discuss tasks and challenges, and to uncover the next frontiers for academic research on the topic. The workshop program accommodated two invited talks, eight refereed papers divided into two technical paper sessions, and a group discussion.

In the forthcoming issue of SIGIR Forum we give a detailed summary of the workshop; the preprint of this article is available here. The workshop papers are available online in the ACM Digital library and at the workshop website. The latter also contains copies of the slides for most presentations.