CLEF 2010 Keynote 2

The second CLEF 2010 keynote, entitled Retrieval Evaluation in Practice, was given by Ricardo Baeza-Yates. As yesterday, here are my raw notes from the lecture.


  • Why to measure?
    • For Yahoo!: to compare against competition (other web search systems)
  • The main questions then is: How to compare?
  • A few things to note
    • Queries and users are related
    • Measures should be related to the users => at the end what we care about is the satisfaction of the user

Choosing the queries

  • How many?
    • Depends on many things
    • To get meaningful statistical results (although that doesn’t necessarily mean much)
  • What type of queries?
    • Easy, difficult, popular, long tail, etc.
    • Understand the long tail!
      • Many queries, each asked very few times, make up a large fraction of all queries
      • Heavy tail queries matter to a lot of people
  • Which queries?
    • How to sample?
  • Search engines side effects
    • Misspelled queries, etc.

Sampling the query distribution (Zaragoza et al., CIKM 2010)

  • Standard approaches
    • Uniform sampling, no repetitions
    • Large i.i.d. sample, repetitions
      • But the distribution is a power law!
      • Frequencies will not be realistic
  • Solution: use a larger sample to estimate frequency
  • Example:
    • 50M queries
    • Samples of 1K, 100K and 10M
    • Use other samples to compute the frequencies for the 1K sample
    • Plot the rank of each query in the 50M sample

Getting the right answers

  • Judgments
    • Professional editors: expensive
    • Crowdsourcing: works!
    • User: clicks
  • How deep to judge?
    • Top 5, 10, 20, …
    • Tradeoff between depth and number of queries is an interesting problem!
  • Individual queries vs. sessions
  • Crucial for ML ranking algorithms

Example: using clicks from image MLR (Murdock, van Zwol et al. 2009)

  • Learning from clicks
  • Important: the user is biased by the search interface
  • User clicks give relative preference
  • Train and evaluate in “blocks”
    • People read text from top-to-bottom, left-to-right
    • For images, they first pick the best from the lot, then look at others around it
  • Textual features
    • TFIDF over different fields
    • Cosine similarity
    • Max/Avg tfidf score
  • Visual features
    • Colour (histogram, layout, scalable color)
    • Edge (CEDD, Edge histogram)
    • Texture
  • Classification
    • Two classes: clicked and non-clicked
    • Assume they are separable by a hyperplane
  • Results
    • Retrieval baseline < Textual features < Visual features < Combined features
  • Summary
    • Combination of Text and Visual most powerful
    • Block construction effective even though doesn’t reflect human gaze
    • Most important: clicks work!


  • What to measure?
    • Are they reasonable in the case of ties?
  • How valid are these measures for the users?
    • Do they really reflect what they are doing? => We don’t have an answer yet
  • Individual queries vs. sessions
    • New session track in TREC
  • Why there is an obsession with the average?
    • Variance is also important
      • What about Search engine 1 (SE1) that amazes you half of the time and makes you unhappy for the other half of queries vs. SE2 that performs worse on average, but consistently (no surprises)
    • Improvements in solved queries are not relevant
    • Large improvements might be better
  • Measure also the variance (Hosseini et al., IRFC 2010)
  • Count the number of queries improved
    • UIR: Unanimous Improvement Ratio (Amigo, Gonzalo et al., 2010)
  • Weight the relevant improvements


Web Search Solved? All Result Rankings the Same? (Zaragoza et al., CIKM 2010)

  • Traditionally, search engines are compared based on DCG
  • An alternative methodology relying on classification of queries based on
    • query difficulty
    • relative DCG performance of search engines
  • Queries are classified under multiple sets
    • Solved set
    • Hard set
    • Tied set
    • Disruptive sets (one set per search engine)
  • It’s crucial to correct for sampling


  • Q: Hard vs. easy queries?
    • A: We don’t know :) Typically: popular queries are easy, long-tail queries are hard; hard queries are the ones with only a few resources. Also, relevance is personal.
  • Q: How to measure the complexity of the query (for the SE)?
    • A: Could be based on whether one collection (e.g., one language) can answer the query.


Leave a Comment