NORTHEASTERN UNIVERSITY—MS IN INFORMATION SYSTEMS
INFO 7610 –NATURAL LANGUAGE ENGINEERING METHODS AND TOOLS
AKSHAY SURESH BHOSALE
- NBC
- LEMMATIZATION, PREPROCESSING,
- VECTORIZATION
- MODELING
- LSTM-SCRATCH
- PREPROCESSING
- TOKENIZATION
- PADDING
- NEURALNETWORKMODELING
- LSTM USINGGOOGLENEWSVECTORS
- INDEXINGVECTORS
- PREPROCESSING
- TOKENIZATION
- PADDING& EMBEDDING
- NEURALNETWORKMODELING
- BERT
- PREPROCESSING
- SEMANTICDATAGENERATOR
- DISTRIBUTIONSTRATEGYSCOPE
- MODELING
- IMPORTDATASET
- EXPLORE DATASET
- DROPNULLVALUES
- PREPROCESSINGTHETEXT
- LEMMATIZE
- REMOVESTOPWORDS
- CONCATQUESTIONS 1 AND QUESTIONS 2
- VECTORIZINGUSINGTF-IDF
- SPLITDATASETINTOTRAIN-TEST
- MODEL TRAINING
- FITMODELON TESTSET
Lemmatize and Removing Stop Words
Combined Question 1 & Question 2
Vectorization using TF-IDF
Accuracy on Test Set = 74.06%
- EXPLORE DATASET
- SHAPE, SIZEANDDUPLICATES
- FILLMISSINGVALUESUSING‘FILLNA()’ METHOD
- DEFINEFUNCTIONFORTEXTPREPROCESSING:
- REMOVE PUNCTUATIONS
- REMOVESTOPWORDS
- USEREGULAREXPRESSIONSLIBRARYTOSORTTHROUGHMISSPELLEDWORDS
- STEMMINGOFTEXT
- TEXTTOTOKENSORTOKENIZATIONOFTEXTANDREPLACETHEM
- ADDPADDINGSEQUENCESUSINGKERASPREDEFINEDMETHODS
- NEURALNETWORKMODELING
- DEFINELSTM MODEL
- TRAIN MODEL
- CHECKVALIDATIONACCURACYFOROPTIMUMRESULT
- PREDICTONTESTSET
- PRE-DEFINED VARIABLES USED:
- MAXIMUMTEXTLENGTH= 50
- MAXIMUMTOKEN LENGTH= MAX[ MAX(TRAININGTOKENS), MAX(TESTINGTOKENS)] USING ‘NP.MAX’
- BATCHSIZEFOR NEURALNETWORK TRAINING= 128
- EPOCHS= 16 (RANDOMLY SELECTED)
- TRAINSIZE: VALIDATIONSIZE= 82 : 18
[contd.]
- SEQUENTIAL MODEL
- EMBEDDING LAYER DIMENSIONS : MAX_TOKEN +1000, 32
- DROPOUT AT0.
- LSTM LAYER WITH SHAPE NONE, 32
- DROPOUT LAYER 2 AT0.
- DENSELAYERUSINGSIGMOID ACTIVATION FUNCTION
- LOSS FUNCTION = BINARY CROSSENTROPY
- OPTIMIZER = ‘RMSPROP’
[contd.]
Accuracy on Validation Set
78.67%
- PREDEFINEPARAMETERS
- INDEXWORDVECTORS
- TEXT PREPROCESSING
- REMOVESTOPWORDS
- CONVERTTEXTTOLOWERCASE
- REMOVE PUNCTUATIONS
- STEMMINGTEXTS
- TOKENIZATION: CONVERTTEXTTOSEQUENCES
- PADSEQUENCESTODETERMINESHAPETOBUILDNEURALNETWORKTENSOR
- PREPAREEMBEDDINGMATRIX
- SPLITTRAININGSETINTOTRAINANDVALIDATIONSET
- DEFINENEURALNETWORK
- SETUPEMBEDDINGLAYER
- SETUPLSTM LAYER
- SETUPSEQUENTIALLAYERS
- MERGELAYERS
- CHECKWHETHERTORE-WEIGHTCLASSESTOFIT17.5% SHAREINTESTSET
- COMPILEANDTRAINTHEMODEL
- CHECKVALIDATIONACCURACY
- PREDICTONTESTSET
- PRE-DEFINED VARIABLES USED:
- MAXIMUMTEXTLENGTH= 30
- EMBEDDING DIMENSION= 300
- BATCHSIZEFOR NEURALNETWORK TRAINING= 2048
- EPOCHS= 27 (RANDOMLY SELECTED)
- RANDOMINPUTTO LSTM LAYERBETWEEN 175 - 275
- RANDOMINPUTTO DENSELAYER BETWEEN 100 - 150
- TRAINSIZE: VALIDATIONSIZE= 90 : 10
[contd.]
- SEQUENTIAL MODEL: 2 INPUT LAYERS
- EMBEDDING LAYER DIMENSIONS : 30, 300
- LSTM: WITH SHAPE NONE, 188
- CONCATENATE FOLLOWED BY DROPOUT LAYER
- BATCH NORMALIZATION FOLLOWED BY DENSE LAYER WITH SHAPENONE, 119
- SECOND DROPOUT LAYER
- BATCH NORMALIZATION LAYER
- SECOND DENSE LAYER
- LOSS FUNCTION = BINARY CROSSENTROPY
- OPTIMIZER = ‘N-ADAM’
[contd.]
Accuracy on Validation Set : 81.61%
BERT – TRANSFER LEARNING USING TRANSFORMERS
TOKENIZERS
- PREDEFINEPARAMETERS
- INDEXWORDVECTORS
- DROPNULLVALUES
- SPLITDATASETINTOTRAIN, TESTANDVALIDATIONSET
- CHECKTARGETDISTRIBUTION
- DEFINEPREPROCESSINGANDBERT SEMANTICDATAGENERATORFUNCTION
- LOADBERT TOKENIZER(BERT-BASE-UNCASED)
- DEFINENUMBEROFBATCHESPEREPOCH
- ENCODEBOTHQUESTIONSTOGETHERSEPARATEDBYSEP TOKEN
- CONVERTENCODEDFEATURETONUMPYARRAYINBATCHES
- SHUFFLEINDEXES
- CREATEMODELUNDERDISTRIBUTIONSTRATEGYSCOPE
- LOADPRETRAINEDBERT MODEL
- SETUPENCODEDTOKENIDSFROMBERT TOKENIZER
- SETUPATTENTIONMASKS
- SETUPTOKENTYPEIDS
- LOADINGPRETRAINEDBERT MODEL
- FREEZETHEBERT MODELTOREUSETHEPRETRAINEDFEATURESWITHOUTMODIFYINGTHEM.
- ADDTRAINABLELAYERS
- COMPILE& TRAIN MODEL
- EVALUATEONTESTSET
- PRE-DEFINED VARIABLES USED:
- MAXIMUMTEXTLENGTH= 50
- EMBEDDING DIMENSION= 300
- BATCHSIZEFOR NEURALNETWORK TRAINING= 168
- EPOCHS= 2 (SELECTED ONTHEBASISOF COMPUTINGRESOURCEANDTIME)
- TRAINSIZE: TESTSIZE: VALIDATIONSIZE= 70 : 15 : 15
[contd.]
- INPUT LAYER: NONE, 50
- ATTENTIONMASK: NONE, 50
- TOKENTYPEIDS: NONE, 50
- TF_BERT_MODEL: NONE, 50, 768
- IDIRECTIONALLSTM: NONE, 50, 256
- GLOBAL AVERAGEPOOLING: NONE, 256
- GLOBAL MAXPOOLING: NONE, 256
- CONCATENATE: NONE, 512
- DROPOUT: NONE, 512
- DENSE: NONE, 2
- LOSSFUNCTION= CATEGORICALCROSS ENTROPY
- OPTIMIZER= ‘ADAM’
[contd.]
[contd.]
Accuracy on validation Set : 89.91%
- HTTPS://KERAS.IO/EXAMPLES/NLP/SEMANTIC_SIMILARITY_WITH_BERT/
- HTTPS://WWW.KAGGLE.COM/COMPETITIONS/QUORA-QUESTION-PAIRS
- HTTPS://WWW.KAGGLE.COM/DATASETS/LEADBEST/GOOGLENEWSVECTORSNEGATIVE 300
- HTTPS://HUGGINGFACE.CO/DATASETS?TASK_IDS=TASK_IDS:SEMANTIC-SIMILARITY- CLASSIFICATION
- HTTPS://HUGGINGFACE.CO/BERT-BASE-UNCASED
- HTTPS://HUGGINGFACE.CO/TFTRANSFORMERS/BERT-BASE-UNCASED