CLEF 2010 Keynote 2

The second CLEF 2010 keynote, entitled Retrieval Evaluation in Practice, was given by Ricardo Baeza-Yates. As yesterday, here are my raw notes from the lecture.

Intro

Why to measure?
- For Yahoo!: to compare against competition (other web search systems)
The main questions then is: How to compare?
A few things to note
- Queries and users are related
- Measures should be related to the users => at the end what we care about is the satisfaction of the user

Choosing the queries

How many?
- Depends on many things
- To get meaningful statistical results (although that doesn’t necessarily mean much)
What type of queries?
- Easy, difficult, popular, long tail, etc.
- Understand the long tail!
  - Many queries, each asked very few times, make up a large fraction of all queries
  - Heavy tail queries matter to a lot of people
Which queries?
- How to sample?
Search engines side effects
- Misspelled queries, etc.

Sampling the query distribution (Zaragoza et al., CIKM 2010)

Standard approaches
- Uniform sampling, no repetitions
- Large i.i.d. sample, repetitions
  - But the distribution is a power law!
  - Frequencies will not be realistic
Solution: use a larger sample to estimate frequency
Example:
- 50M queries
- Samples of 1K, 100K and 10M
- Use other samples to compute the frequencies for the 1K sample
- Plot the rank of each query in the 50M sample

Getting the right answers

Judgments
- Professional editors: expensive
- Crowdsourcing: works!
- User: clicks
How deep to judge?
- Top 5, 10, 20, …
- Tradeoff between depth and number of queries is an interesting problem!
Individual queries vs. sessions
Crucial for ML ranking algorithms

Example: using clicks from image MLR (Murdock, van Zwol et al. 2009)

Learning from clicks
Important: the user is biased by the search interface
User clicks give relative preference
Train and evaluate in “blocks”
- People read text from top-to-bottom, left-to-right
- For images, they first pick the best from the lot, then look at others around it
Textual features
- TFIDF over different fields
- Cosine similarity
- Max/Avg tfidf score
- …
Visual features
- Colour (histogram, layout, scalable color)
- Edge (CEDD, Edge histogram)
- Texture
Classification
- Two classes: clicked and non-clicked
- Assume they are separable by a hyperplane
Results
- Retrieval baseline < Textual features < Visual features < Combined features
Summary
- Combination of Text and Visual most powerful
- Block construction effective even though doesn’t reflect human gaze
- Most important: clicks work!

Measuring

What to measure?
MAP, MRR, DCG, NDCG
- Are they reasonable in the case of ties?
How valid are these measures for the users?
- Do they really reflect what they are doing? => We don’t have an answer yet
Individual queries vs. sessions
- New session track in TREC
Why there is an obsession with the average?
- Variance is also important
  - What about Search engine 1 (SE1) that amazes you half of the time and makes you unhappy for the other half of queries vs. SE2 that performs worse on average, but consistently (no surprises)
- Improvements in solved queries are not relevant
- Large improvements might be better
Measure also the variance (Hosseini et al., IRFC 2010)
Count the number of queries improved
- UIR: Unanimous Improvement Ratio (Amigo, Gonzalo et al., 2010)
Weight the relevant improvements

Comparing

Web Search Solved? All Result Rankings the Same? (Zaragoza et al., CIKM 2010)

Traditionally, search engines are compared based on DCG
An alternative methodology relying on classification of queries based on
- query difficulty
- relative DCG performance of search engines
Queries are classified under multiple sets
- Solved set
- Hard set
- Tied set
- Disruptive sets (one set per search engine)
It’s crucial to correct for sampling

Q&A

Q: Hard vs. easy queries?
- A: We don’t know :) Typically: popular queries are easy, long-tail queries are hard; hard queries are the ones with only a few resources. Also, relevance is personal.
Q: How to measure the complexity of the query (for the SE)?
- A: Could be based on whether one collection (e.g., one language) can answer the query.

Leave a Comment Cancel reply