MC2 2018 Lab

Multilingual Cultural Mining and Retrieval

Home > Tasks 2018 > 3-Dialectal Focus Retrieval

3-Dialectal Focus Retrieval

Objective

This task aims to automatically distingate between the Modern Standard Arabic (MSA) and Dialect (DA) content in arabic langage. We focus mainly on short text.
The idea is to use an unsupervised and supervised approaches in order to detect Arabic dialects content in collected data from Twitter.
The main goal is to identify whether the whole tweet is mostly MSA or DA.

Task description

Use case
Given a set of Arabic Tweets T for registered participants, determine whether or not T contains dialectal content.
A more fine-grained level for runs could indiquate could indicate in which dialect T was written.

Dataset

We will provide here for registrated participants a training and test set of MSA and 4 different language variaties called regional dialects.
First, we give acces to a training set of tweet in csv format.
Each instance of our training set was extracted from the Arabic Spring Tweet corpus.
Tweets are tagged with the predominate DA or MSA in which they were written.
Annotation was realized by a native tongue arabic speaker.
After a training phase, we will release a test set containing unidentified instances of each language varieties to be classified according to the langage of origin.
We list below the aforementioned languages:

  • MSA: The standardized variety of Arabic used in most formal speech
  • GLF: A variety of the Arabic language spoken in Eastern Arabia
  • EGY: A variety of Arabic most contemporary spoken in Egypt.
  • MGR: A variety of Arabic most contemporary spoken in North Africa
  • LEV: A spoken dialects along the Eastern Mediterranean Coast.

Evaluation

The official evaluation measures planned are: weighted F1 score and Accuracy

Result Submission

Each run must contain 3 fields:

  • N° : tweet number
  • Ranking measure : a float number varied from 1 to 0 :
    - 1 if it is a pure MSA
    - 0 if it is a DA
    - float number reflecting the probability of MSA in the tweet accorded by participant system
  • the identified dialect: the native tongue arabic tweet

Schedule

  • Registration closes: 30 April 2018
  • End Evaluation Cycle: 26 May 2018
  • Submission of Participant Papers [CEUR-WS]: 02 June 2018
  • Notification of Acceptance Participant Papers: 15 June 2018
  • Camera Ready Copy of Participant Papers: 29 June 2018
  • September 10-14 2018 CLEF 2018 Conference

Task organizers (contact : malek.hajjem@univ-avignon.fr)

  • Malek Hajjem (LIA, Avignon)
  • Fatiha Sadat (UQAM, Montréal)
  • Juan-manuel Torres (LIA, Avignon)