NLP PROJECT PIPELINE:
DATA ACQUISITION
-Available data(cvs, txt,pdf)
-other data(Database;Internet,API data, scraped data,)
-No data(create your own data ; use LLM to make data; issue out surveys then perform dat
a augmentation such as BIagram flip, BAck Translate, Add additonal noise, replace with synonymns incase of less data)
TEXT PREPARATION
GENERAL /BASIC PREPROCESSING
Tokenization
stop words removal
stemming/ lemmatization
remove punctuation
convert to lower case
CONTEXT BASED / ADVANCED PREPROCESSINg
POS tagging
Parsing(parsing tree)
Coreference reduction
FEATURE ENGINEERING
MODEL TRAINING
MDOEL EVALUATION
PROJECT SETUP AND UTILITY MODULES (Flow chart of how we arrange the folders) :
(first)create the Project template.py that will provide code for creating the folders , subfolders and files according to the flowchart below instead of creating each manually
HATE classification(root folder)
COMPONENTS folder
data_ingestion.py
data transformation.py
model trainer.py
model_evaluation.py
model_pusher.py
(fifth) CONFIGURATION folder
s3 operations.py
CONSTANTS folder
artifact_entity.py
config_entity.py
EXCEPTIONS folder
LOGGER folder
PIPELINE folder
train_pipeline.py
Prediction_Pipeline.py
UTILS folder
main_utils.py
ML folder
feature
models
main or app.py
jupyuter_notebook.py
requirement.txt
Dockerfile
setup.py
.dockerIgnore
(second)setup the requiremnts
cofigure GCP
(third)Logging
(fourth)Exception
(sixth)project workflow(flowchart of the project folders)
(sixth)project notebook experiment jupyter done in Google colab; how we can do the project on a jupyter notebook
Data Preprocessing---handle missing values, encode categorical values, scale numerical features
EDA
feature engineering--- engineering new features based on the knowledge and insights from the EDA
Model building ---- train and evaluate several models to ge the best
Model evaluation ----evaluate the trained models so as to choose the best model
(seventh)
Update the constants folder
update the cofigurations folder
update the artifacts folder
update the components folder
update the pipeline folder
update the main file or the app.py
Class: DataIngestion
Method: get_data_from_source()
Fetch data from APIs, databases, or remote servers.
Handle authentication, error handling, and retries for robustness.
Logging: Log every data retrieval step.
Method: read_csv(file_path)
Load data from CSV or other file formats.
Error Handling: Handle cases where the file doesn’t exist or is corrupted and provide useful error messages.
Logging: Log the path of the file and the shape of the dataset.
Method: ingest_data()
A final method that calls the other two methods to either fetch the data from the source or read from local files, based on user input or configuration.
Class: DataTransformation
Method: drop_columns(columns_to_drop)
Drop unnecessary columns that are not useful for the classification model.
Perform checks to avoid dropping critical columns accidentally.
Logging: Log the list of columns dropped.
Method: normalize_data()
Normalize numerical features if applicable.
Ensure compatibility with downstream processes (like embedding layers for text).
Log the columns being normalized.
Method: lemmatize_or_stem(text_column)
Provide an option to either stem or lemmatize the text.
Log the choice
Use libraries like nltk or spacy for the text processing.
def lemmatize_or_stem(self, df, text_column, method='lemmatize'):
try:
if method == 'lemmatize':
lemmatizer = WordNetLemmatizer()
df[text_column] = df[text_column].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
logging.info(f"Applied lemmatization on {text_column}")
elif method == 'stem':
stemmer = PorterStemmer()
df[text_column] = df[text_column].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
logging.info(f"Applied stemming on {text_column}")
return df
except Exception as e:
logging.error(f"Error occurred during {method} on {text_column}")
raise CustomExcecptionError(e, f"Error in {method} method.")
Class: FeatureEngineering
Method: tokenize(text_column)
Tokenize the text into words or subwords.
Handle different tokenization schemes (e.g., word-level or byte-pair encoding).
Log the number of tokens created.
def tokenize(self, df, text_column):
try:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df[text_column])
logging.info(f"Tokenization completed. Number of unique tokens: {len(tokenizer.word_index)}")
return tokenizer.texts_to_sequences(df[text_column])
except Exception as e:
logging.error("Error occurred during tokenization.")
raise CustomExcecptionError(e, "Error in tokenization.")
Method: text_to_sequences(tokenized_texts)
Convert tokenized text into sequences of integers using a vocabulary.
Pad the sequences to ensure uniform length.
Log the shape of the resulting sequences.
def text_to_sequences(self, tokenized_texts):
try:
padded_sequences = pad_sequences(tokenized_texts, maxlen=100)
logging.info(f"Padded sequences with shape: {padded_sequences.shape}")
return padded_sequences
except Exception as e:
logging.error("Error occurred during text-to-sequences conversion.")
raise CustomExcecptionError(e, "Error in text-to-sequences conversion.")
Method: word_embedding(sequences)
Apply pre-trained word embeddings like Word2Vec, GloVe, or train embeddings during model training.
Ensure that the embeddings are aligned with the tokenization.
Log whether a pre-trained embedding is used or trained from scratch.
def word_embedding(self, sequences, embedding_matrix=None):
try:
if embedding_matrix is not None:
logging.info("Using pre-trained word embeddings.")
else:
logging.info("Training word embeddings from scratch.")
# Placeholder for word embedding logic
return sequences # Replace with embedding logic
except Exception as e:
logging.error("Error in applying word embeddings.")
raise CustomExcecptionError(e, "Error in word embedding.")
Method: encode_target_variable(target_column)
Encode labels or target variables for classification.
Convert categorical labels to numerical format if necessary.
Log the classes being encoded.
def encode_target_variable(self, df, target_column):
try:
label_encoder = LabelEncoder()
df[target_column] = label_encoder.fit_transform(df[target_column])
logging.info(f"Encoded target column {target_column}. Classes: {label_encoder.classes_}")
return df
except Exception as e:
logging.error("Error occurred during target variable encoding.")
raise CustomExcecptionError(e, "Error in encoding target variable.")
Class: NeuralNetworkModel
Method: build_model(embedding_matrix=None)
Build a sequential or functional Keras/TensorFlow model for text classification.
Include layers like embedding, LSTM/GRU, Conv1D, and Dense.
Save the model architecture and summary.
def build_model(self, embedding_matrix=None):
try:
model = Sequential()
# Add layers: Embedding, LSTM, Dense, etc.
logging.info("Neural network model built successfully.")
model.summary(print_fn=logging.info)
return model
except Exception as e:
logging.error("Error in building the neural network model.")
raise CustomExcecptionError(e, "Error in building neural network model.")
Method: train_model(X_train, y_train)
Compile and train the neural network model.
Use appropriate loss functions (e.g., categorical_crossentropy) and metrics (e.g., accuracy).
Log the training process and save the model checkpoints.
def train_model(self, X_train, y_train, model):
try:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_split=0.2)
logging.info("Model training completed successfully.")
model.save('model_checkpoint.h5')
return history
except Exception as e:
logging.error("Error occurred during model training.")
raise CustomExcecptionError(e, "Error in training model.")
Class: RandomForestModel
Method: train_model(X_train, y_train)
Train a random forest classifier using scikit-learn.
Tune hyperparameters using techniques like GridSearchCV.
Log the best parameters and model metrics.
def train_model(self, X_train, y_train):
try:
model = RandomForestClassifier()
model.fit(X_train, y_train)
logging.info("Random forest model trained successfully.")
joblib.dump(model, 'random_forest_model.pkl')
return model
except Exception as e:
logging.error("Error occurred during random forest training.")
raise CustomExcecptionError(e, "Error in training random forest model.")
Class: ModelEvaluation
Method: evaluate_model(X_test, y_test)
Evaluate the trained models on test data.
Use metrics like accuracy, precision, recall, F1-score, and confusion matrix.
Method: plot_metrics()
Plot performance metrics such as ROC curve, precision-recall curve, and confusion matrix for visual interpretation.