The second CLEF 2010 keynote, entitled Retrieval Evaluation in Practice, was given by Ricardo Baeza-Yates. As yesterday, here are my raw notes from the lecture.
Intro
- Why to measure?
- For Yahoo!: to compare against competition (other web search systems)
- The main questions then is: How to compare?
- A few things to note
- Queries and users are related
- Measures should be related to the users => at the end what we care about is the satisfaction of the user
Choosing the queries
- How many?
- Depends on many things
- To get meaningful statistical results (although that doesn’t necessarily mean much)
- What type of queries?
- Easy, difficult, popular, long tail, etc.
- Understand the long tail!
- Many queries, each asked very few times, make up a large fraction of all queries
- Heavy tail queries matter to a lot of people
- Which queries?
- How to sample?
- Search engines side effects
- Misspelled queries, etc.
Sampling the query distribution (Zaragoza et al., CIKM 2010)
- Standard approaches
- Uniform sampling, no repetitions
- Large i.i.d. sample, repetitions
- But the distribution is a power law!
- Frequencies will not be realistic
- Solution: use a larger sample to estimate frequency
- Example:
- 50M queries
- Samples of 1K, 100K and 10M
- Use other samples to compute the frequencies for the 1K sample
- Plot the rank of each query in the 50M sample
Getting the right answers
- Judgments
- Professional editors: expensive
- Crowdsourcing: works!
- User: clicks
- How deep to judge?
- Top 5, 10, 20, …
- Tradeoff between depth and number of queries is an interesting problem!
- Individual queries vs. sessions
- Crucial for ML ranking algorithms
Example: using clicks from image MLR (Murdock, van Zwol et al. 2009)
- Learning from clicks
- Important: the user is biased by the search interface
- User clicks give relative preference
- Train and evaluate in “blocks”
- People read text from top-to-bottom, left-to-right
- For images, they first pick the best from the lot, then look at others around it
- Textual features
- TFIDF over different fields
- Cosine similarity
- Max/Avg tfidf score
- …
- Visual features
- Colour (histogram, layout, scalable color)
- Edge (CEDD, Edge histogram)
- Texture
- Classification
- Two classes: clicked and non-clicked
- Assume they are separable by a hyperplane
- Results
- Retrieval baseline < Textual features < Visual features < Combined features
- Summary
- Combination of Text and Visual most powerful
- Block construction effective even though doesn’t reflect human gaze
- Most important: clicks work!
Measuring
- What to measure?
- MAP, MRR, DCG, NDCG
- Are they reasonable in the case of ties?
- How valid are these measures for the users?
- Do they really reflect what they are doing? => We don’t have an answer yet
- Individual queries vs. sessions
- New session track in TREC
- Why there is an obsession with the average?
- Variance is also important
- What about Search engine 1 (SE1) that amazes you half of the time and makes you unhappy for the other half of queries vs. SE2 that performs worse on average, but consistently (no surprises)
- Improvements in solved queries are not relevant
- Large improvements might be better
- Variance is also important
- Measure also the variance (Hosseini et al., IRFC 2010)
- Count the number of queries improved
- UIR: Unanimous Improvement Ratio (Amigo, Gonzalo et al., 2010)
- Weight the relevant improvements
Comparing
Web Search Solved? All Result Rankings the Same? (Zaragoza et al., CIKM 2010)
- Traditionally, search engines are compared based on DCG
- An alternative methodology relying on classification of queries based on
- query difficulty
- relative DCG performance of search engines
- Queries are classified under multiple sets
- Solved set
- Hard set
- Tied set
- Disruptive sets (one set per search engine)
- It’s crucial to correct for sampling
Q&A
- Q: Hard vs. easy queries?
- A: We don’t know :) Typically: popular queries are easy, long-tail queries are hard; hard queries are the ones with only a few resources. Also, relevance is personal.
- Q: How to measure the complexity of the query (for the SE)?
- A: Could be based on whether one collection (e.g., one language) can answer the query.