The 3rd Semantic Search workshop (SemSearch2010) was held on Monday in conjunction with the WWW2010 conference at Raleigh, NC, USA.
This post is about the highlights of the workshop, with some personal comments at the end. For more information, check the post by Christian Grant, Jeff Dalton, and the #semsearch2010 hashtag on twitter.
An excellent keynote, entitled Why users need semantic search, was given by Barney Pell from Bing.
- The 3 main problems of search:
- Imprecise results (25% of clicks lead to ‘back’)
- Refinement (42% of sessions need refined)
- Lengthy tasks (50% of decision sessions are lengthy >30min)
- Entity centered experiences
- Arise when the user’s intent is focused on a specific entity
- E.g., entity cards, table of contents, related entities
- Key issues
- Ambiguity requires entity resolution
- Entities must be disambiguated to unique IDs in massive DBs
- Arise when the user’s intent is focused on a specific entity
- Semantic improvements to core search
- Semantic retrieval & ranking
- Better entity tagging
- Derive semantic graphs from text
- Semantic proximity rather than textual proximity
- Semantic query understanding
- Presentation and captions
- Smarter text selection for captions
- Smart summarization
- Captions with word variation
- Captions with structured data (e.g., processing LinkedIn as structured data)
- Faceted search (e.g., chicken recipes)
- Answers & question answering / Task-oriented experiences
- User wants to make a decision and get something done (book flight, hotel, etc.)
- Conversational assistant “Get me a table for two at Don Giovanni (at 8pm)”
- Semantic retrieval & ranking
- Semantics is a key for reducing today’s search pain
- Great results (better document and query understanding)
- Better organized page results, faceted refinement, smart suggestions
- Powerful decision making tools
- Now is the right time to work on search and semantics!
The best paper award went to Using BM25F for Semantic Search by José R. Pérez-Agüera et al. [PDF]. In their paper the authors use a fielded index to represent RDF structure (text, title, objects, inlinks, and rdf type) and apply structured IR models for semantic document retrieval. The fielded variations of Lucene and BM25 were evaluated on DBPedia (corresponding to the INEX 2009 Wikipedia collection). The empirical evaluation highlights the shortcomings of the multi-fielded Lucene weighting scheme, which actually performs worse than the single-fielded variation (the reason is that Lucene computes a linear combination of TF values from various fields, which breaks the desired saturation effect). Regarding BM25F, with the exception of the title field, using fields does not lead to any notable improvements in effectiveness.
Peter Mika presented his work on Distributed Indexing for Semantic Search [PDF]. Two types of indexing techniques were proposed for storing terms along with properties: horizontal (two fields: one for terms, one for properties) and vertical (one field per property). There are some requirements that the query engine needs to support; the alignment operator for horizontal indexing and fields for vertical indexing (MG4J was used). Evaluation was done on the BTC 2009 dataset (the same as used in the SemSearch Entity Search track). Somewhat surprisingly, the horizontal index seems to be the more efficient one for both keywords queries and field restricts.
A very interesting paper A Large-Scale System for Annotating and Querying Quotations in News Feeds by Jisheng Liang et al. [PDF] describes the method that evri.com uses for quotation search. Specifically, they look for patterns of “What did <speaker> say about <subject>?”, where speaker or subject can be specified as a specific entity, a facet, keywords, or boolean combinations of these. The collected data is indexed as (Subject, Action, Object) triples. Currently, more than 10 million quotes are stored, with an additional 60K added each day. I was especially excited about hearing about some of the under-the-hood details. Their entity repository comprises about 2 million instances, mainly the ones covered in news and/or blogs regularly (and is expanding continuously). The properties they are storing for each entity are: a unique ID, type and facets in a taxonomy (e.g., Person/Sports/Athlete/Basketball_Player), description, synonyms and aliases, type- and facet-specific attributes (e.g., birth date, birth place for person), and relation properties (e.g., teams, league for basketball players).
Of course, I should not forget to advertise our position paper, authored by me, Edgar Meij and Maarten de Rijke, entitled Entity Search: Building Bridges between Two Worlds [PDF|Slides]. We consider the task of entity search and examine to which extent state-of-art information retrieval (IR) and semantic web (SW) technologies are capable of answering information needs that focus on entities. We also explore the potential of combining IR with SW technologies to improve the end-to-end performance on a specific entity search task (related entity finding). We arrive at and motivate a proposal to combine text-based entity models with semantic information from the Linked Open Data cloud.
Finally, the Entity Search Track was discussed (not to be confused with the TREC Entity track — draft guidelines there are coming very soon). First of all, I must say that I really like the initiative and appreciate all the effort organizers put into setting it up. It is also refreshing to see that there are more and more efforts within the Semantic Web community for proper evaluation platforms (see also the SEALS paper from this workshop).
There are, however, a few things that make me worry:
- In its current setup, the task is arguably “just” RDF document search. From the assessments point of view it is really critical how results are rendered — maybe something more semantic-ish would be needed here? (There is structure, and it is a graph after all, so …)
- The UMass run, which, according to the authors “… were simple language modeling runs with Indri” ranked 2nd, only a tiny bit behind the top run from Yahoo Barcelona, who label themselves as “Semantic Web people” — Does it mean that “semantic search” comes down to fielded IR (also, cf. with the best paper of the workshop)?
- Reusability aspects of the collection were seemingly not considered at all. Sure, you can evaluate semantic search (as well as many other tasks) using Amazon’s Mechanical Turk, but would it not be the point to compare systems on the same task and input (topics)?
- The fact that there was no “duplicate detection” was a real slip. Especially in this setting, where the same entity is likely to be represented over multiple sources.
- After all this, I was a bit surprised that organizers were already planning to move on to the next task, as if this one was solved. The first editions of tracks are usually just warm-ups (at least in the TREC, CLEF, INEX, etc. experience), the real deal begins from year two.
I can understand that many from the SW community see IR people obsessed with evaluation. But that doesn’t mean there is nothing to learn from it (both successes and failures).
Nice notes – could you elaborate on the reusability of test collection comments?
I thought the idea was to set up a benchmarking platform for semantic entity search (well, entity search over semantic data). The main point in benchmarking is comparing different methods/approaches. So I was expecting a message like “if you’re working on this area, you should use this collection for evaluation, and here are a few things that you should be aware of if you want to compare your method against the best performing ones here…” (and mention that the results you’re returning might not have been assessed, etc). Instead, in my interpretation, the take-home message was: “here is how you can evaluate search results using Mechanical Turk”.