1 - Content Analysis

Organisers: IRIT, Université de Montréal, LISIS


Given a stream of microblogs the content analysis tasks consists in:

  1. filtering microblogs dealing with festivals;
  2. language(s) identification;
  3. event localization;
  4. author categorization (official account, participant, follower or scam);
  5. WikiPedia entity recognition and translation in four target languages: English, Spanish, Portuguese and French.
  6. automatic summarization of linked WikiPedia pages in the four target languages.

Each item will be evaluated independently, however, language identification could impact WikiPedia linking and the resulting summaries.

A login is required to acces the data, once registered on CLEF each registered team can obtain up to 4 extra individual logins by writing to admin@talne.eu.



Each individual participant can only submit one run per sub-task, so up to 5 runs per team. Submissions will be uploaded on a MySQL server through web interface.

Expected formats for each subtask are tables in which the primary key is the micro-blog id and have some extra fields.

  1. filtering: one extra field with a normalized score between 0 and 1, 1 being the maximal score for a micro-blog surely related to a specific festival event.
  2. language(s) identification; three extra fields containing two letter ISO 639-1 code for languages, first field for the main language, last field for a subsidiary or less probable language.
  3. event localization; five extra fields for a ranked list of cities (IATA codes) related to the micro-blog.
  4. author categorization: one extra field with one of the categories ’official’ when the microblog has been posted by the organizers of the festival or a media broadcasting the event or an invited artist; ’participant’ when it has been posted by a non official individual in the public; ’follower’ for individuals following the festival but not taking port in it ; and scam or troll ;
  5. entity recognition: one table per target language, each one with 10 extra fields corresponding to a ranked list of WikiPedia entries (page titles) related to the micro-blog. List is ranked by decreasing relevance. Participants can submit less than four languages.
  6. automatic summarization of linked WikiPedia pages in every language: one table per language, one extra field with a short summary of 120 words (sequences of characters separated by spaces).

Articles in this section

  • Wikipedia XML corpus for summary generation

    by Eric SanJuan

    Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities.
    We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, (...)