1 - Content Analysis

Organisers: IRIT, Université de Montréal, LISIS


Given a stream of microblogs the content analysis tasks consists in:

  1. filtering microblogs dealing with festivals;
  2. language(s) identification;
  3. event localization;
  4. author categorization (official account, participant, follower or scam);
  5. WikiPedia entity recognition and translation in four target languages: English, Spanish, Portuguese and French.
  6. automatic summarization of linked WikiPedia pages in the four target languages.

Each item will be evaluated independently, however, language identification could impact WikiPedia linking and the resulting summaries.

A login is required to acces the data, once registered on CLEF each registered team can obtain up to 4 extra individual logins by writing to admin@talne.eu.



Each individual participant can only submit one run per sub-task, so up to 5 runs per team. Submissions will be uploaded on a MySQL server through web interface.

Expected formats for each subtask are tables in which the primary key is the micro-blog id and have some extra fields.

  1. filtering: one extra field with a normalized score between 0 and 1, 1 being the maximal score for a micro-blog surely related to a specific festival event.
  2. language(s) identification; three extra fields containing two letter ISO 639-1 code for languages, first field for the main language, last field for a subsidiary or less probable language.
  3. event localization; five extra fields for a ranked list of cities (IATA codes) related to the micro-blog.
  4. author categorization: one extra field with one of the categories ’official’ when the microblog has been posted by the organizers of the festival or a media broadcasting the event or an invited artist; ’participant’ when it has been posted by a non official individual in the public; ’follower’ for individuals following the festival but not taking port in it ; and scam or troll ;
  5. entity recognition: one table per target language, each one with 10 extra fields corresponding to a ranked list of WikiPedia entries (page titles) related to the micro-blog. List is ranked by decreasing relevance. Participants can submit less than four languages.
  6. automatic summarization of linked WikiPedia pages in every language: one table per language, one extra field with a short summary of 120 words (sequences of characters separated by spaces).


  • CLEF 2017 Microblog Cultural Contextualization overviews in Dublin

    Labs 4, 13:45-15:45, CMC, room 5039

    Content analysis and Microblog Search:

    1. Detailed overview
    2. participant presentations
    3. discussion towards Cultural Image Queries over Social Media.

    Labs 5, 16:45-18:15, CMC, room 5039

    Time Line Illustration:

    1. Detailed overview
    2. evaluation material release
    3. discussion towards dealing with Language Dialects and Varieties in Mining and Search over Cultural Social Media posts.

    View online : CLEF 2017 program

  • Topics released for task 3

    Topics are given in the file clef_mc2_task3_topics.xml

    There are extracted from 4 festival programs (see readme file): Vielles Charrues 2015
    Transmusicales 2015, Avignon 2016, Edinburgh 2016.

  • Topics released for tasks 1 and 2

    Topics have been released for tasks 1 and 2.

    A login is required to acces the data, once registered on CLEF each registered team can obtain up to 4 extra individual logins by writing to admin@talne.eu.

    The complete stream of 70 000 000 microblogs is available for registered participants.
    An indri Index with a web interface and online API are available to query the whole set of microblogs.

Articles in this section