Preprocessing the text data scraped using Fed-Scraper.
1. hidden_vars.py
Add a file hidden_vars.py
containing a variable DATA_DIR
equal to a path of the directory containing the csv
file produced by Fed-Scraper. This can be obtained by running the web scraper or simply downloading the dataset from kaggle.
Create configurations using the preprocessing rules found in preprocessing_rules.py
and add them to the CONFIGS
list.
Note: some rules take as input a:
sentence
- a string containing a sentence, while others takeword_list
- a list of strings which are each words from a sentence.
Note: you must run create_n_grams.py
before using the n_gram_creation
rule.
Run process_dataset.py
to execute each of the configurations on the dataset, saving the results in the directory PREPROCESSED_DATA_DIR
specified in global_vars.py
.