Detect French Fake News
These notebooks can scrap and train a camemBERT model to detect True or False News in French. More information can be found into /doc/ folder (in French).
- Scraping TRUE news from French Newspapers using "gbolmier/newspaper-crawler" :
- Futura Sciences
- Liberation
- Telerama
- Le Monde
- Le Figaro
For Le Monde, it is necessary to modify the spider : newspaper_crawler/spiders/lemonde_spider.py
-
issue #1 : fixing lemonde body scraping
-
Scraping TRUE news using "scrapy" for :
-
20 minutes
-
Scraping FAKE news from French Parody Newspapers using "scrapy" :
- Le Gorafi
- NordPresse.be
- BuzzBeed.com
-
Train camemBERT model
-
Evaluate
-
Compare to baseline
Files :
01_Scraping_French_newspaper_crawler.ipynb : This notebook can be used to scrap french news from RSS feed of these newspapers :
- Figaro
- Futura Sciences
- Liberation
- Le Monde
Doesn't work for :
- Nouvel Obs
This notebook can be executed several times to add new news to database : newspaper_db.
02_Scraping_French_News.ipynb : This notebook use scrapy classes to retrieve latest news content from :
- Le Gorafi (societe, politique)
- Nord Presse.be (France)
- BuzzBeed
- 20 Minutes
2 possible sources :
- pages links and next page (best)
- RSS feed (possible to update data)
To select only RSS :
- execute only RSS parts and finish by Export parts.
03_Train_evaluate_camemBERT.ipynb : French Fake News Detection model with camemBERT model This notebook contains :
- Exploration of news
- Preparation input data
- Training camemBERT Sequence Classification (using "simpletransformers")
- Evaluation
- Works on Google Colab
- Choose GPU Execution type
04_Train_evaluate_baseline.ipynb : French Fake News Detection baseline model This notebook contains :
- Preparation input data TF-IDF
- Training baseline Sequence Classification (using "LogisticRegression")
- Evaluation
- Works on Google Colab