Context The majority of the previous studies got focused on detecting fake news in the English language due to the availability of well-known annotated fake corpus openly available, variety of fact-checkers around the world while the less-resourced languages left behind such as the Kurdish language. While the Kurdish language is spoken by more than 30 million people around the world, yet, it is considered as less-resourced in the Natural Language Processing (NLP) domain due to the inaccessibility of NLP tools and the shortage or unavailability of the labeled corpus. This is a repository for a fake news dataset for a research project at the College of Informatics, Sulaimani Polytechnic University, Iraq.
In this paper: full details about data collection, pre-processing and classifiers used on this dataset.
Content Our dataset consists of 3 sets of news articles crawled from Facebook pages in KurdKurdish language only in different subjects. The dataset consists of a set of articles/news labeled by 0 (fake) or 1 (credible).
The dataset consists of: -5000 articles labeled as true -5000 articles labeled false -5000 articles automatically modified from the real articles to create fake news