At the broadest level, my research interests focus on intelligent information access; this entails developing retrieval technology that supports humans in dealing with massive volumes of data. The topical areas I am particularly interested in are entity retrieval, information needs, semantics, and evaluation. Previously, I have worked extensively on expertise retrieval (mostly as part of my doctoral research). While working on these themes, I focus on two distinct but complementary methodological aspects: modeling and engineering.

Entity retrieval

Many user information needs concern entities: people, organizations, locations, products, etc. These are better answered by returning specific objects instead of mere documents. One example is expert search: returning a ranked list of people that are knowledgeable about a given query topic [1]. The blog distillation task provides another example; there, we want systems to retrieve blogs (as opposed to blog posts) that are principally devoted to a given topic [3]. The great challenge, of course, is to develop methods that are able to retrieve not just a single type of entity (person or blog), but arbitrary types of entities. I am interested in developing models that are able to capture and exploit semantically structured information that is available for entities. Much of this work makes use of knowledge bases published as Linked Data [4—6].
Another interesting area of investigation is that of searching for entities based on relationships between them. For example, identifying organizations that have a certain type of relation to a specific person. I was the lead organizer of the TREC Entity track, an annual international benchmarking effort, that introduced and featured the related entity finding task: return a ranked list of entities of a given type that engage in a requested relation with some input entity (e.g., “airlines that currently use Boeing 747 planes”) [2].

Selected publications:

  1. A Language Modeling Framework for Expert Finding, IPM 2009
  2. Ranking Related Entities: Components and Analyses, CIKM 2010
  3. Blog Feed Search with a Post Index, IRJ 2011
  4. Query Modeling for Entity Search Based on Terms, Categories and Examples, TOIS 2011
  5. On the Modeling of Entities for Ad-hoc Entity Search in the Web of Data, ECIR 2012
  6. Example Based Entity Finding in the Web of Data, ECIR 2013

Information needs

Queries are an expression of the user’s information need, usually in the form of a sequence of a few keywords; often, this is a very sparse representation. Therefore, obtaining a better representation of information needs is an important theme. I have looked at query expansion for various IR tasks: document [1], expert [2], email [3], blog [4,6], and entity [5] search, using generative language modeling techniques. These approaches work much like blind relevance feedback, although expansion terms are sampled not only from documents [1,2] but also from more complex, “beyond the document” contexts, such as the thread, mailing list and community for email search [3], expertise profiles for enterprise document and expert finding, external collections for blog retrieval [4,6], and associated categories for entity search [5].
But there is more. Users with specialized information needs are often willing to express these in a more elaborate manner than a short sequence of keywords. Abstractions of such scenarios, with additional user inputs, were realized recently at various evaluation platforms. At the TREC 2007 Enterprise track a few example documents were provided. And at the INEX Entity Ranking track category information and example entities are given. I have developed theoretically sound ways of using these types of input for query modeling and expansion [1,5].

Selected publications:

  1. A Few Examples Go A Long Way: Constructing Query Models from Elaborate Query Formulations, SIGIR 2008
  2. Non-Local Evidence for Expert Finding, CIKM 2008
  3. Using Contextual Information to Improve Search in Email Archives, ECIR 2009
  4. A Generative Blog Post Retrieval Model that Uses Query Expansion based on External Collections, ACL 2009
  5. Query Modeling for Entity Search Based on Terms, Categories and Examples, TOIS 2011
  6. Exploiting External Collections for Query Expansion, TWEB 2012


In my research, I define semantics in terms of structures with (reference to a) meaning. Bringing in semantics to text-based entity modeling is an on-going research focus of mine. While working on expertise retrieval, I considered different types of structure: on the collection level (link structure, importance of document types), document level (linguistic structure, internal structure of emails), on the topic level (topic hierarchy), and on the organizational level (organizational hierarchy). The language modeling setting allowed me to incorporate these different types of structures into the models in a theoretically sound and (often) effective way [1]. A next step in this direction was the introduction of category-based entity modelling; the main idea here is to capture not only textual material but also type information associated with entities [4].
Linked Open Data is a recent contribution of the emerging Semantic Web that has the potential of providing this “extra” semantic information; it offers vast amounts of training material for associating language usage around entities with explicit and semantically meaningful entity types and relations. In recent work I have investigated the utility of LOD for entity retrieval [5,6] and, to a limited extent, also the possibility of combining Web data and Linked Open Data [2,3].

Selected publications:

  1. People Search in the Enterprise, PhD Thesis 2008
  2. SaHaRa: Discovering Entity-Topic Associations in Online News, ISWC 2009
  3. Ranking Related Entities: Components and Analyses, CIKM 2010
  4. Query Modeling for Entity Search Based on Terms, Categories and Examples, TOIS 2011
  5. On the Modeling of Entities for Ad-hoc Entity Search in the Web of Data, ECIR 2012
  6. When Simple is (more than) Good Enough: Effective Semantic Search with (almost) no Semantics, ECIR 2012


Evaluation is a key challenge within information retrieval. I have been actively engaged in building test collections (UvT Expert Collection, the Sindice-2011 Dataset, and the DBpedia-Entity test set) and running international benchmarking campaigns (WebCLEF in 2006 and the TREC Entity track between 2009 and 2011 [1—3]). I also have an on-going interest in developing new evaluation methodology, particularly in the context of entity profiling [5].
The data divide between academia and industry is increasing. Usage and interaction data are key ingredients to commercial search engines; yet, getting access to this type of data proves to be very difficult for academics. Recently, I have been working on a “living labs” methodology that would allow access to this type of proprietary data to the broader research community [4,6]. In current work I’m expending effort to operationalize a living labs for IR benchmarking evaluation initiative [7,8].

Selected publications:

  1. Overview of the TREC 2009 Entity Track, TREC 2009
  2. Overview of the TREC 2010 Entity Track, TREC 2010
  3. Overview of the TREC 2011 Entity Track, TREC 2011
  4. Towards a Living Lab for Information Retrieval Research and Development. A Proposal for a Living Lab for Product Search Tasks, CLEF 2011
  5. On the Assessment of Expertise Profiles, JASIST 2013
  6. Report on the CIKM Workshop on Living Labs for Information Retrieval Evaluation, SIGIR Forum 2014
  7. Head First: Living Labs for Ad-hoc Search Evaluation, CIKM 2014


Expertise retrieval

My PhD research was focused on developing methods for people search in an organizational setting. In particular, two core tasks were investigated: expert finding (retrieving people that are experts on a given topic) and expert profiling (characterizing the skills and knowledge of a person). The fact that people are not represented directly (as retrievable units, such as documents) gave rise to the main scientific challenges examined in my PhD thesis, which were (1) to identify people indirectly through their occurrences in documents, (2) to represent them, and (3) to match these representations with those of queries. Both expertise retrieval tasks were approached as an association finding problem between topics and people. Associations are captured using a probabilistic generative framework, based on statistical language models. The developed models were shown to be powerful, effective, and able to incorporate a number of extensions in a transparent and theoretically sound way [1—8].
When modeling expertise, one can consider more than content-based evidence that is directly available from (related) documents. Indeed, humans take several other contextual factors into account, like organizational structure, position, experience, and social distance, when making decisions of which expert(s) to select or recommend. These contextual factors can be quantified and combined with content-based methods to improve retrieval effectiveness [9,10].

Selected publications:

  1. Formal Models for Expert Finding in Enterprise Corpora, SIGIR 2006
  2. Finding Experts and their Details in E-mail Corpora, WWW 2006
  3. Determining Expert Profiles (With an Application to Expert Finding), IJCAI 2007
  4. Broad Expertise Retrieval in Sparse Data Environments, SIGIR 2007
  5. Finding Similar Experts, SIGIR 2007
  6. Associating People and Documents, ECIR 2008
  7. Non-Local Evidence for Expert Finding, CIKM 2008
  8. A Language Modeling Framework for Expert Finding, IPM 2009
  9. Contextual Factors for Finding Similar Experts, JASIST 2010
  10. A User-oriented Model for Expert Finding, ECIR 2011
  11. Expertise Retrieval, FnTIR 2012