MC2 2018 Lab

Multilingual Cultural Mining and Retrieval

Home > Data > Available ressources Clef 2018: detailed description

Available ressources Clef 2018: detailed description

Wednesday 14 March 2018, by Malek Hajjem

The festival galleries dataset

A massive collection of microblogs and urls related to culture festivals are provided for registered participants here .
In order to deal with such large dataset we propose different format :

  • A CSV format : It is a tab-separated CSV file that could be useful in case of managing dataset via a Mysql database or python programming langague.
  • An XML format for Indri: This format could be smoothly indexed with Indri in case of need. With tweet textual content some metadata ( see description above ) is also provided. We note that XML files are grouped by author.
<!ELEMENT xml (f, m)+>
<!ELEMENT f (#user_id)>
<!ELEMENT m (i, u, l, c d, t)>
<!ELEMENT i (#microblog_id)>
<!ELEMENT u (#user)>
<!ELEMENT l (#ISO_language_code)>
<!ELEMENT c (#client>
<!ELEMENT d (#date)>

The festival galleries dataset is presented partially or totally. In case of a partial format, each csv file contains gathered tweets by month. Original tweets are separated from rediffused tweets to manage lighter files.

Originals:                                       Re posts:  

1- 2015-05(72M)                          2015-05(54M)
2- 2015-06(235M)                        2015-06(190M)
3- 2015-07(220M)                        2015-07(162M)
...                                               ...
...                                               ...
...                                               ...
...                                               ...
18- 2016-10(102M)                       2016-10(148M)
  • HTML form to test queries: this form make you able to test the Microblog search baseline system using an Indri query
    *Simple queries:
    For a basic query, just type in the terms you wish to search on. Each term will be weighed equally and combined in an "or" fashion.
    -  hiphop jazz

      #combine(hiphop jazz )

    *Phrase Matching:
    To search for a specific phrase (i.e. "hiphop jazz"), you can wrap your terms using the ordered window operator #n (where n is the window size of the number of terms).

     #1(hiphop jazz)

    Your search results would return only those documents where the terms "hiphop" and "jazz" appear in order.

    *Unordered Windows
    The #uwN operator performs a search on terms that occur within a certain window size.

    For example, if we wanted to look for the terms "hiphop" and "jazz" that occured within  2 terms of each other, but we did not care if the term "hiphop" came before "jazz" or not, we would write this as:

       #uw2(hiphop jazz)

    *Boolean Searches

    By default, the Indri will return a document if any of the terms occur in the document; documents that contain more terms will generally be ranked above documents that contain fewer terms. If you wish to specify that all of your search terms must be included, you can use the "boolean and" operator (#band). For example, if you want to ensure that the terms "hiphop" and "jazz" both exist, use:

       #band(hiphop jazz)
  • PERL API used to interroge the web service locally with suitable query in Indri language
  • Indri parameter files : A parameter file in XML format useful to reindex the collection with Indri
  • Compressed Indri Indexes per month
  • Programs to generate xml repositories from CSV ordered data
  • Root of Indri indexes and data


An uncompressed list of tweets url is available for participants in csv format. This metadata could be used to explore more the tweet content.