Skip to content

bhosaleakshay666/NLP-Quora-Question-Pairs-Similarity

Repository files navigation

QUORA QUESTION PAIR

SIMILARITIES USING NBC, LSTM

AND BERT

NORTHEASTERN UNIVERSITY—MS IN INFORMATION SYSTEMS

INFO 7610 –NATURAL LANGUAGE ENGINEERING METHODS AND TOOLS

AKSHAY SURESH BHOSALE

QUORA QUESTION PAIR SIMILARITIES
  1. NBC
    1. LEMMATIZATION, PREPROCESSING,
    2. VECTORIZATION
    3. MODELING
  2. LSTM-SCRATCH
    1. PREPROCESSING
    2. TOKENIZATION
    3. PADDING
    4. NEURALNETWORKMODELING
  3. LSTM USINGGOOGLENEWSVECTORS
    1. INDEXINGVECTORS
    2. PREPROCESSING
    3. TOKENIZATION
    4. PADDING& EMBEDDING
    5. NEURALNETWORKMODELING
  4. BERT
    1. PREPROCESSING
    2. SEMANTICDATAGENERATOR
    3. DISTRIBUTIONSTRATEGYSCOPE
    4. MODELING
NAÏVE BAYES CLASSIFIER
  1. IMPORTDATASET
  2. EXPLORE DATASET
    1. DROPNULLVALUES
  3. PREPROCESSINGTHETEXT
    1. LEMMATIZE
    2. REMOVESTOPWORDS
    3. CONCATQUESTIONS 1 AND QUESTIONS 2
  4. VECTORIZINGUSINGTF-IDF
  5. SPLITDATASETINTOTRAIN-TEST
  6. MODEL TRAINING
  7. FITMODELON TESTSET
NAÏVE BAYES CLASSIFIER

Lemmatize and Removing Stop Words

Combined Question 1 & Question 2

NAÏVE BAYES CLASSIFIER

Vectorization using TF-IDF

Accuracy on Test Set = 74.06%
LSTM (LONG SHORT-TERM MEMORY): SCRATCH
  1. EXPLORE DATASET
    1. SHAPE, SIZEANDDUPLICATES
    2. FILLMISSINGVALUESUSING‘FILLNA()’ METHOD
  2. DEFINEFUNCTIONFORTEXTPREPROCESSING:
    1. REMOVE PUNCTUATIONS
    2. REMOVESTOPWORDS
    3. USEREGULAREXPRESSIONSLIBRARYTOSORTTHROUGHMISSPELLEDWORDS
    4. STEMMINGOFTEXT
  3. TEXTTOTOKENSORTOKENIZATIONOFTEXTANDREPLACETHEM
  4. ADDPADDINGSEQUENCESUSINGKERASPREDEFINEDMETHODS
  5. NEURALNETWORKMODELING
    1. DEFINELSTM MODEL
    2. TRAIN MODEL
    3. CHECKVALIDATIONACCURACYFOROPTIMUMRESULT
  6. PREDICTONTESTSET
LSTM (LONG SHORT-TERM MEMORY)
  • PRE-DEFINED VARIABLES USED:
    • MAXIMUMTEXTLENGTH= 50
    • MAXIMUMTOKEN LENGTH= MAX[ MAX(TRAININGTOKENS), MAX(TESTINGTOKENS)] USING ‘NP.MAX’
    • BATCHSIZEFOR NEURALNETWORK TRAINING= 128
    • EPOCHS= 16 (RANDOMLY SELECTED)
    • TRAINSIZE: VALIDATIONSIZE= 82 : 18

[contd.]

  • SEQUENTIAL MODEL
  • EMBEDDING LAYER DIMENSIONS : MAX_TOKEN +1000, 32
  • DROPOUT AT0.
  • LSTM LAYER WITH SHAPE NONE, 32
  • DROPOUT LAYER 2 AT0.
  • DENSELAYERUSINGSIGMOID ACTIVATION FUNCTION
  • LOSS FUNCTION = BINARY CROSSENTROPY
  • OPTIMIZER = ‘RMSPROP’

LSTM (Model Architecture & Layers)

[contd.]

LSTM (RESULTS) [contd.]
Accuracy on Validation Set
78.67%
LSTM USING GOOGLE NEWS VECTORS EMBEDDINGS
  1. PREDEFINEPARAMETERS
  2. INDEXWORDVECTORS
  3. TEXT PREPROCESSING
    1. REMOVESTOPWORDS
    2. CONVERTTEXTTOLOWERCASE
    3. REMOVE PUNCTUATIONS
    4. STEMMINGTEXTS
  4. TOKENIZATION: CONVERTTEXTTOSEQUENCES
  5. PADSEQUENCESTODETERMINESHAPETOBUILDNEURALNETWORKTENSOR
  6. PREPAREEMBEDDINGMATRIX
  7. SPLITTRAININGSETINTOTRAINANDVALIDATIONSET
  8. DEFINENEURALNETWORK
    1. SETUPEMBEDDINGLAYER
    2. SETUPLSTM LAYER
    3. SETUPSEQUENTIALLAYERS
    4. MERGELAYERS
    5. CHECKWHETHERTORE-WEIGHTCLASSESTOFIT17.5% SHAREINTESTSET
    6. COMPILEANDTRAINTHEMODEL
    7. CHECKVALIDATIONACCURACY
  9. PREDICTONTESTSET
LSTM USING GOOGLE NEWS VECTORS EMBEDDINGS
  • PRE-DEFINED VARIABLES USED:
    • MAXIMUMTEXTLENGTH= 30
    • EMBEDDING DIMENSION= 300
    • BATCHSIZEFOR NEURALNETWORK TRAINING= 2048
    • EPOCHS= 27 (RANDOMLY SELECTED)
    • RANDOMINPUTTO LSTM LAYERBETWEEN 175 - 275
    • RANDOMINPUTTO DENSELAYER BETWEEN 100 - 150
    • TRAINSIZE: VALIDATIONSIZE= 90 : 10

[contd.]

  • SEQUENTIAL MODEL: 2 INPUT LAYERS
  • EMBEDDING LAYER DIMENSIONS : 30, 300
  • LSTM: WITH SHAPE NONE, 188
  • CONCATENATE FOLLOWED BY DROPOUT LAYER
  • BATCH NORMALIZATION FOLLOWED BY DENSE LAYER WITH SHAPENONE, 119
  • SECOND DROPOUT LAYER
  • BATCH NORMALIZATION LAYER
  • SECOND DENSE LAYER
  • LOSS FUNCTION = BINARY CROSSENTROPY
  • OPTIMIZER = ‘N-ADAM’
LSTM using Google News Vectors
Embeddings (Model Architecture & Layers)

[contd.]

LSTM (RESULTS) [contd.]

Accuracy on Validation Set : 81.61%

BERT – TRANSFER LEARNING USING TRANSFORMERS
TOKENIZERS
  1. PREDEFINEPARAMETERS
  2. INDEXWORDVECTORS
  3. DROPNULLVALUES
  4. SPLITDATASETINTOTRAIN, TESTANDVALIDATIONSET
  5. CHECKTARGETDISTRIBUTION
  6. DEFINEPREPROCESSINGANDBERT SEMANTICDATAGENERATORFUNCTION
    1. LOADBERT TOKENIZER(BERT-BASE-UNCASED)
    2. DEFINENUMBEROFBATCHESPEREPOCH
    3. ENCODEBOTHQUESTIONSTOGETHERSEPARATEDBYSEP TOKEN
    4. CONVERTENCODEDFEATURETONUMPYARRAYINBATCHES
    5. SHUFFLEINDEXES
  7. CREATEMODELUNDERDISTRIBUTIONSTRATEGYSCOPE
  8. LOADPRETRAINEDBERT MODEL
    1. SETUPENCODEDTOKENIDSFROMBERT TOKENIZER
    2. SETUPATTENTIONMASKS
    3. SETUPTOKENTYPEIDS
    4. LOADINGPRETRAINEDBERT MODEL
    5. FREEZETHEBERT MODELTOREUSETHEPRETRAINEDFEATURESWITHOUTMODIFYINGTHEM.
    6. ADDTRAINABLELAYERS
    7. COMPILE& TRAIN MODEL
  9. EVALUATEONTESTSET
BERT – TRANSFER LEARNING USING TRANSFORMERS
TOKENIZERS
  • PRE-DEFINED VARIABLES USED:
    • MAXIMUMTEXTLENGTH= 50
    • EMBEDDING DIMENSION= 300
    • BATCHSIZEFOR NEURALNETWORK TRAINING= 168
    • EPOCHS= 2 (SELECTED ONTHEBASISOF COMPUTINGRESOURCEANDTIME)
    • TRAINSIZE: TESTSIZE: VALIDATIONSIZE= 70 : 15 : 15

[contd.]

  • INPUT LAYER: NONE, 50
  • ATTENTIONMASK: NONE, 50
  • TOKENTYPEIDS: NONE, 50
  • TF_BERT_MODEL: NONE, 50, 768
  • IDIRECTIONALLSTM: NONE, 50, 256
  • GLOBAL AVERAGEPOOLING: NONE, 256
  • GLOBAL MAXPOOLING: NONE, 256
  • CONCATENATE: NONE, 512
  • DROPOUT: NONE, 512
  • DENSE: NONE, 2
  • LOSSFUNCTION= CATEGORICALCROSS ENTROPY
  • OPTIMIZER= ‘ADAM’

BERT – Transfer Learning using

Transformers Tokenizers

[contd.]

BERT (RESULTS)

[contd.]

Accuracy on validation Set : 89.91%

REFERENCES:
  1. HTTPS://KERAS.IO/EXAMPLES/NLP/SEMANTIC_SIMILARITY_WITH_BERT/
  2. HTTPS://WWW.KAGGLE.COM/COMPETITIONS/QUORA-QUESTION-PAIRS
  3. HTTPS://WWW.KAGGLE.COM/DATASETS/LEADBEST/GOOGLENEWSVECTORSNEGATIVE 300
  4. HTTPS://HUGGINGFACE.CO/DATASETS?TASK_IDS=TASK_IDS:SEMANTIC-SIMILARITY- CLASSIFICATION
  5. HTTPS://HUGGINGFACE.CO/BERT-BASE-UNCASED
  6. HTTPS://HUGGINGFACE.CO/TFTRANSFORMERS/BERT-BASE-UNCASED

THANK YOU SO MUCH

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published