A Framework based on Random Forests and non-uniform feature sampling

Claudia Meda, Edoardo Ragusa, Christian Gianoglio, Rodolfo Zunino,
A. Ottaviano, E. Scillia, R. Surlinelli

A non-uniform features sampling built on the analyst's decisions, using the Random Forests Machine Learning Algorithm. In recent years, a lot of researchers have been published about the effectiveness of the use of Random Forest for spam classification purposes, but nobody took advantage of human skills for an improvement in automated classifications and overcame the black box problem.

Another contribution of this research is the development of a dataset used as a benchmark for spammer detection automatic tools.


Dataset is made up of 1161 rows (users) and 54 columns (features); 802 are legitimate users, (label 0), and 359 are spammers, (label 1). See the paper and the table of the features.

For any questions about the article or the dataset please write to one of the authors.

This page makes the dataset, collected and used by the authors for the experiments, available. All the material is packed into a password protected zip file. Please contact Claudia Meda (claudia.meda@edu.unige.it), Edoardo Ragusa (edoardo.ragusa@edu.unige.it) or Christian Gianoglio (christian.gianoglio@edu.unige.it) to get the password.