2016 CMC workshop

Microlog Data Set

by Eric SanJuan

The document collection provided by GAFES project consists a pool of more than 70M unique microblogs from different sources with their meta-information and expanded URLs on a MySQL server. Due to legal terms the access to this database is restricted to registered participants under privacy agreement.

Along with the microblog corpus, a clean simplified xml dump of wikipedia easy to index and to process with state of the art NLP tools is made available to participants. Ground truth material is the following: