CLEF 2010 Keynote 1

I am at the CLEF conference this week. Here are my raw, unedited notes from the first keynote, IR Between Science and Engineering, and the Role of Experimentation, by Norbert Fuhr.

1. Weaknesses of current IR evaluation

  • Finding of Armstrong et al., CIKM 2009
    • Over 90% of the papers that claim improvements exist only due to poor baselines, but do not beat the best TREC results
    • Improvements don’t add up
    • Need for standardized benchmarks => registry of runs under standardized conditions
  • Measures vs. User standpoints
    • MAP is a system-oriented measure
    • More focus on user-oriented measures
      • Specify user standpoints/tasks
      • Use corresponding measure
    • More focus on high-precision vs. high-recall methods
  • Significant vs. Substantial
    • Significance tests
      • Appropriate: Wilcoxon test, T-test, Randomization test
      • It only tells that there is a difference
    • Sparck Jones, 1974. “performance differences…
      • <5% must be disregarded
      • 5-10% as noticeable
      • >10% as material”
    • Do users care?
      • Users are able to compensate for differences in retrieval quality
      • But they notice quality differences
      • Effect of retrieval quality on user’ task success rate unclear
      • Only weak correlation between retrieval quality and user satisfaction


2. The discovery of context

  • “The Cranfield Cave” (J√§rvelin, 2009)
  • Studies on end-user behaviour
    • Searchers use unorthodox queries and sessions (very short queries, combinations thereof)
    • Searchers seek to optimize entire information seeking process
      • single shot vs. process
      • rationalism vs. incrementalism
    • IRS must support the whole process, focus on single queries is not helpful!
  • Session support of open source IRS (Van der Bouwheide, IRF Symposium 2010)
    • Most important features for patent search are not supported
  • Experimental study: effect of support (Kriewel and Fuhr, ECIR 2010 best student paper)
  • Saracevic’s levels of evaluation context
    • Content – Processing – Engineering – Interface – Individual – Institutional – Social
    • System centered: Context – … – Interface
    • User centered: Interface – … – Social
    • User studies focus on Interface+Individual
    • Mainstream IR focuses on Processing
    • We should focus on all levels


3. Towards a more scientific approach in IR

  • Experimental paradigm
    • Controlled variables: Documents, Topics, Relevance assessments
    • Independent variables: Document/topic representation methods, Matching method, Evaluation procedure
    • Dependent variable: Evaluation results
  • How is IR Quality affected by document and topic characteristics (length, …)?
  • What we have done so far is observations for a few points in an N-dimensional space of IR experimentation
  • In Computer Science Science Truly Scientific? (Gonzalo Genova, CACM 7/2010)
    • Empirical approaches in Science:
      • Verificationism: Optimistic approach – induction is possible. But: Russel’s “inductive turkey”
      • Falsificationism: Pessimistic approach (Popper) – conjecture can only be falsified
  • Purely empirical approaches will not lead to scientific progress
  • Need to develop theroetic models
  • Science: Why? vs. Engineering: How?
  • Science: Explanatory power vs. Engineering: Good results (on some collections)
  • Science: Basis for broad variety of approaches vs. Engineering: Potential for some further improvements (in limited settings)
  • Science: Long-standing vs. Engineering: Short-living
  • Goals of a scientific approach
    • Better understanding
      • Verifiably theories we have so far
        • Probability Ranking Principle
        • Relevance-oriented probabilistic models
        • Language Models
        • Term-weighting axioms
    • Better understanding
      • Independence assumption (in Binary Independence Retrieval): terms are distributed independently in the relevant and in the irrelevant documents => Did anyone ever check this?
    • Better prediction
      • Define parameter set for describing test collections (documents, topics)
      • Characterize existing test collections
      • Regard subcollections
      • Design new collections
      • Analyze
      • Repeat the cycle
    • How to define the boundaries of current knowledge?
    • Broaden view
      • IR in context


Conclusions

  • IR is still in its scientific adolescence
  • Engineering approaches are fine for tuning specific applications (e.g., Web search), but…
  • For the broad range of IR applications, only more scientific approaches will be helpful


Q&A:

  • Q: What should we do? Where should we look for theories?
    • A: More thorough analysis; to what extent results are affected by certain parameters?
    • Stop accepting papers with pseudo improvements
    • We are too focused on improvements in MAP; we should give more credits to scientific findings
  • Q: One requirement of experimental science is the repeatability of the experiment. What is going to happen if we bring in the user into the loop (i.e., bringing in context)?
    • A: This is the same for example in pschyhology, and it’s not a problem there…
  • Q: Role of evaluation campaigns…
    • A: Conferences should not accept papers with low baselines

 

Leave a Comment