MC2 2018 Lab

Multilingual Cultural Mining and Retrieval

Home > Data > Content Analysis Results: Language identification 2017

Content Analysis Results: Language identification 2017

Thursday 15 March 2018, by Malek Hajjem

Results

  • Topics are a random selection of original microblogs posted in June 2016 without external links and with more then 80 characters.
  • Submissions and scores for the two best teams can be found here Syllabs and Lia.
  • The task paper can be found here
@inproceedings{DBLP:conf/clef/ErmakovaMS17,
 author    = {Liana Ermakova and
              Josiane Mothe and
              Eric SanJuan},
 title     = {{CLEF} 2017 Microblog Cultural Contextualization Content Analysis
              task Overview},
 booktitle = {Working Notes of {CLEF} 2017 - Conference and Labs of the Evaluation
              Forum, Dublin, Ireland, September 11-14, 2017.},
 year      = {2017},
 crossref  = {DBLP:conf/clef/2017w},
 url       = {http://ceur-ws.org/Vol-1866/invited_paper_14.pdf},
 timestamp = {Thu, 16 Nov 2017 14:36:59 +0100},
 biburl    = {https://dblp.org/rec/bib/conf/clef/ErmakovaMS17},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}

Evaluation process

The Evaluation process detects the reliability of the language on Twitter.
In fact, Tweet objects have a long list of ‘root-level’ attributes, including fundamental attributes such as "lang". When present, this attribute indicates a BCP 47 language identifier corresponding to the machine-detected language from where the microblog was edited. Obviously the machine-detected language may be different from the microblog langage.
Scores in this evaluation are assigned by a human expert. Only the tweets where the results of participants’ language detector systems differ from tweet’s "lang" attribute were examined. Tweets in several languages have a graduated score describing how much a language is present on it.