The good and the bad news

A quick update on the TREC Entity track, which reminds me of the classical good news-bad news situation. The good news is that we have just reached 100 members on the TREC entity mailing list. The bad news is that almost all of them are mute.
On a more serious account, the track guidelines need to get finalized very soon. One way of interpreting the silence is that people are happy with the proposed task and all details are clear. There may be other (less positive) interpretations. Whichever the case might be, in the absence of discussion, organizers will simply dictate what is to be done.

Seminar on Searching and Ranking in Enterprises

Today, on the occasion of the PhD defense of Pavel Serdyukov, a seminar on enterprise seach was held at the University of Twente. Three of Pavel’s committee members gave talks: David Hawking, Iadh Ounis, and Maarten de Rijke.
The summaries of the talks will soon be uploaded.
Of course, the main attraction of the day was Pavel’s defense. His PhD thesis is entitled The search for expertise: Beyond direct evidence. He was confronted with interesting, and, sometimes quite challenging questions, but handled them to the satisfaction of the committee. Congratulations Pavel, I mean, Dr. Serdyukov!

Back on TREC

Yes, things have been quiet lately on the TREC Entity homepage. Now that training topics have been made available, I sincerely hope that this is about to change. We are in the process of developing test topics and finalizing the guidelines, so make sure your voice is heard if you want something different…

TREC Entity – draft guidelines available

The first year of the track will investigate the problem of related entity finding. Given the name and homepage of an entity, as well as the type of the target entity, find related entities that are of target type. Entity types are limited to people, organizations, and products. Participating systems need to return homepages of related entities, and, optionally, a string answer that represents the entity concisely (i.e., the name of the entity).

The draft guidelines are available at this page.

Dataset of 1 billion web pages

The ambitious goal set out for TREC 2009 was to have a collection of 1 billion web pages. One dataset that can be shared by several tracks (specifically, the Entity, Million query, Relevance feedback, and Web tracks).
In November 2008, when this was discussed at the TREC 2008 conference, people were concerned with two main questions: (1) Is it possible to create such crawl (given the serious time constraints)? (2)  Are we going to be able to handle (at least, index) this amount of data?
Jamie Callan was confident that they (the Language Technologies Institute at Carnegie Mellon University) could build this crawl by March 2009. His confidence was not unfounded, since they had managed to create a crawl of a few hundreds of millions of web pages earlier. Yet, the counter for the one billion documents collection was to be started from 0 again…
Against this background, let us fast forward to the present. The crawl has recently completed and the dataset, referred to as ClueWeb09, is now available. It is 25 terabytes uncompressed (5 terabytes compressed), which brings me back to the troubling question: are we going to be able to handle that? We (being ILPS) will certainly do our best to step up to the challenge. I shall post about our attempts in detail later on.
But, it is a fact that doing retrieval on 1 billion documents is too big of a bite for many research groups, as it calls for nontrivial software and hardware architecture (note that it is 40 times more data than the Gov2 corpus, which I believe was the largest web crawl available to the research community so far with its 25 million documents). Therefore, a “Category B” subset of the collection is also available, consisting of “only” 50 million English pages. Some of the tracks (the Entity track for sure) will use only the Category B subset in 2009.