Data

For its first edition, this lab give access for registered participants to a massive collection of microblogs and urls related to cultural festivals in the world.

It will allow researchers in IR and NLP to experiment a broad variety of multilingual microblog search techniques (WikiPedia entity search, automatic summarization, language identification, text localization, etc.).

A login is required to acces the data, once registered on CLEF each registered team can obtain up to 4 extra individual logins by writing to admin@talne.eu.

Extensive textual references will be provided by organizers.


Articles in this section

  • Wikipedia XML corpus for summary generation

    by Eric SanJuan

    Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities.
    We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, (...)

  • The festival galleries dataset

    by Eric SanJuan

    This data set allows to experiment microblog search and stream summarization.
    Microblog collection
    The document collection is provided to registered participants by ANR GAFES project. It consists in a pool of more than 50M unique micro-blogs from different sources with their meta-information (...)