Skip to content

fina-biko/nlp_hate_classification

Repository files navigation

THIS IS AN OUTLINE OF THE STEPS BY STEPS WORKFKOW OF THE PROJECT FROM START TO FINISH.

NLP PROJECT PIPELINE:

DATA ACQUISITION

-Available  data(cvs, txt,pdf)
-other data(Database;Internet,API data, scraped data,)
-No data(create your own data  ; use LLM to make data; issue out surveys then perform dat
a augmentation  such as BIagram flip, BAck Translate, Add additonal  noise, replace with synonymns incase of less data)

TEXT PREPARATION

GENERAL /BASIC PREPROCESSING

     Tokenization
     stop words removal
     stemming/ lemmatization
     remove punctuation
     convert to lower case

CONTEXT BASED / ADVANCED PREPROCESSINg

    POS tagging
    Parsing(parsing tree)
    Coreference reduction

FEATURE ENGINEERING

MODEL TRAINING

MDOEL EVALUATION

PROJECT FOLDERS AND FILES

PROJECT SETUP AND UTILITY MODULES (Flow chart of how we arrange the folders) :

(first)create the Project template.py that will  provide code for creating the folders , subfolders and files  according to the flowchart below instead of creating each manually
   

HATE classification(root folder)
        COMPONENTS folder
            data_ingestion.py
            data transformation.py
            model trainer.py
            model_evaluation.py
            model_pusher.py
       (fifth) CONFIGURATION folder
            s3 operations.py
        CONSTANTS folder
                artifact_entity.py
                config_entity.py
        EXCEPTIONS folder
        LOGGER folder
        PIPELINE folder
           train_pipeline.py
           Prediction_Pipeline.py
        UTILS  folder
          main_utils.py
        ML folder
          feature
          models
    main or app.py
    jupyuter_notebook.py
    requirement.txt
    Dockerfile
    setup.py   
    .dockerIgnore 
(second)setup the requiremnts
 cofigure GCP 
(third)Logging 
(fourth)Exception


(sixth)project workflow(flowchart of the project folders)
      (sixth)project notebook experiment jupyter done in Google colab;  how we can do the project on a jupyter notebook
       Data Preprocessing---handle missing values, encode categorical values, scale numerical features
       EDA
       feature engineering--- engineering new features based on the knowledge  and insights from the EDA
       Model building ---- train and evaluate several models  to ge the best
       Model evaluation ----evaluate the trained models so as to choose the best model
    (seventh)
   Update the constants folder
   update the cofigurations folder
   update the  artifacts folder
   update the components folder
   update the pipeline folder
   update the main file or the app.py

COMPREHENSIVE PACKAGES OVERVIEW

1. Data Ingestion Module

Class: DataIngestion

Method: get_data_from_source()

    Fetch data from APIs, databases, or remote servers.
    Handle authentication, error handling, and retries for robustness.
    Logging: Log every data retrieval step.

Method: read_csv(file_path)

    Load data from CSV or other file formats.
    Error Handling: Handle cases where the file doesn’t exist or is corrupted and provide useful error messages.
    Logging: Log the path of the file and the shape of the dataset.

Method: ingest_data()

     A final method that calls the other two methods to either fetch the data from the source or read from local files, based on user input or configuration.

2. Data Transformation Module

Class: DataTransformation

Method: drop_columns(columns_to_drop)

    Drop unnecessary columns that are not useful for the classification model.
    Perform checks to avoid dropping critical columns accidentally.
    Logging: Log the list of columns dropped.

Method: normalize_data()

    Normalize numerical features if applicable.
    Ensure compatibility with downstream processes (like embedding layers for text).
    Log the columns being normalized.

Method: lemmatize_or_stem(text_column)

    Provide an option to either stem or lemmatize the text.
    Log the choice
    Use libraries like nltk or spacy for the text processing.
                                def lemmatize_or_stem(self, df, text_column, method='lemmatize'):
                            try:
                                if method == 'lemmatize':
                                    lemmatizer = WordNetLemmatizer()
                                    df[text_column] = df[text_column].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
                                    logging.info(f"Applied lemmatization on {text_column}")
                                elif method == 'stem':
                                    stemmer = PorterStemmer()
                                    df[text_column] = df[text_column].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
                                    logging.info(f"Applied stemming on {text_column}")
                                return df
                            except Exception as e:
                                logging.error(f"Error occurred during {method} on {text_column}")
                                raise CustomExcecptionError(e, f"Error in {method} method.")

3. Feature Engineering Class (within Transformation Module)

Class: FeatureEngineering

Method: tokenize(text_column)

Tokenize the text into words or subwords.
Handle different tokenization schemes (e.g., word-level or byte-pair encoding).
Log the number of tokens created.
                                        def tokenize(self, df, text_column):
                                    try:
                                        tokenizer = Tokenizer()
                                        tokenizer.fit_on_texts(df[text_column])
                                        logging.info(f"Tokenization completed. Number of unique tokens: {len(tokenizer.word_index)}")
                                        return tokenizer.texts_to_sequences(df[text_column])
                                    except Exception as e:
                                        logging.error("Error occurred during tokenization.")
                                        raise CustomExcecptionError(e, "Error in tokenization.")

Method: text_to_sequences(tokenized_texts)

Convert tokenized text into sequences of integers using a vocabulary.
Pad the sequences to ensure uniform length.
Log the shape of the resulting sequences.
                                    def text_to_sequences(self, tokenized_texts):
                                    try:
                                        padded_sequences = pad_sequences(tokenized_texts, maxlen=100)
                                        logging.info(f"Padded sequences with shape: {padded_sequences.shape}")
                                        return padded_sequences
                                    except Exception as e:
                                        logging.error("Error occurred during text-to-sequences conversion.")
                                        raise CustomExcecptionError(e, "Error in text-to-sequences conversion.")

Method: word_embedding(sequences)

    Apply pre-trained word embeddings like Word2Vec, GloVe, or train embeddings during model training.
    Ensure that the embeddings are aligned with the tokenization.
    Log whether a pre-trained embedding is used or trained from scratch.
                                        def word_embedding(self, sequences, embedding_matrix=None):
                                    try:
                                        if embedding_matrix is not None:
                                            logging.info("Using pre-trained word embeddings.")
                                        else:
                                            logging.info("Training word embeddings from scratch.")
                                        # Placeholder for word embedding logic
                                        return sequences  # Replace with embedding logic
                                    except Exception as e:
                                        logging.error("Error in applying word embeddings.")
                                        raise CustomExcecptionError(e, "Error in word embedding.")

Method: encode_target_variable(target_column)

    Encode labels or target variables for classification.
    Convert categorical labels to numerical format if necessary.
    Log the classes being encoded.
                                        def encode_target_variable(self, df, target_column):
                                    try:
                                        label_encoder = LabelEncoder()
                                        df[target_column] = label_encoder.fit_transform(df[target_column])
                                        logging.info(f"Encoded target column {target_column}. Classes: {label_encoder.classes_}")
                                        return df
                                    except Exception as e:
                                        logging.error("Error occurred during target variable encoding.")
                                        raise CustomExcecptionError(e, "Error in encoding target variable.")

4. Model Training Module

Class: NeuralNetworkModel

Method: build_model(embedding_matrix=None)

    Build a sequential or functional Keras/TensorFlow model for text classification.
    Include layers like embedding, LSTM/GRU, Conv1D, and Dense.
   Save the model architecture and summary.
                                        def build_model(self, embedding_matrix=None):
                                        try:
                                            model = Sequential()
                                            # Add layers: Embedding, LSTM, Dense, etc.
                                            logging.info("Neural network model built successfully.")
                                            model.summary(print_fn=logging.info)
                                            return model
                                        except Exception as e:
                                            logging.error("Error in building the neural network model.")
                                            raise CustomExcecptionError(e, "Error in building neural network model.")

Method: train_model(X_train, y_train)

    Compile and train the neural network model.
    Use appropriate loss functions (e.g., categorical_crossentropy) and metrics (e.g., accuracy).
    Log the training process and save the model checkpoints.
                                            def train_model(self, X_train, y_train, model):
                                        try:
                                            model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
                                            history = model.fit(X_train, y_train, epochs=10, validation_split=0.2)
                                            logging.info("Model training completed successfully.")
                                            model.save('model_checkpoint.h5')
                                            return history
                                        except Exception as e:
                                            logging.error("Error occurred during model training.")
                                            raise CustomExcecptionError(e, "Error in training model.")

Class: RandomForestModel

Method: train_model(X_train, y_train)

    Train a random forest classifier using scikit-learn.
    Tune hyperparameters using techniques like GridSearchCV.
    Log the best parameters and model metrics.
    
                                        def train_model(self, X_train, y_train):
                                    try:
                                        model = RandomForestClassifier()
                                        model.fit(X_train, y_train)
                                        logging.info("Random forest model trained successfully.")
                                        joblib.dump(model, 'random_forest_model.pkl')
                                        return model
                                    except Exception as e:
                                        logging.error("Error occurred during random forest training.")
                                        raise CustomExcecptionError(e, "Error in training random forest model.")

5. Evaluation Module (Optional)

Class: ModelEvaluation

Method: evaluate_model(X_test, y_test)

    Evaluate the trained models on test data.
    Use metrics like accuracy, precision, recall, F1-score, and confusion matrix.

Method: plot_metrics()

    Plot performance metrics such as ROC curve, precision-recall curve, and confusion matrix for visual interpretation.

About

A repository for twitter hate speech classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published