Skip to content

saranshkr/rnn-comparative-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparative Analysis of RNN Architectures for IMDb Sentiment Classification

This project performs a controlled comparative analysis of recurrent neural architectures (RNN, LSTM, Bidirectional LSTM) for binary sentiment classification on the IMDb movie reviews dataset.
The experiments systematically vary activation function, optimizer, sequence length, and gradient clipping, while reporting Accuracy, F1 (macro), and training time per epoch under CPU-only constraints.

Repository Structure

project_root/
├── report.pdf
├── README.md
├── requirements.txt
├── data/
│   ├── imdb_seq_len25.npz
│   ├── imdb_seq_len50.npz
│   ├── imdb_seq_len100.npz
│   ├── imdb_stats.json
│   ├── imdb_vocab.pkl
│   └── raw/
│       └── IMDB_Dataset.csv
├── results/
│   ├── metrics.csv
│   ├── summary_table.csv
│   ├── losses/
│   └── plots/
│       ├── acc_vs_seq_length.png
│       ├── f1_vs_seq_length.png
│       ├── loss_curve_best.png
│       └── loss_curve_worst.png
└── src/
    ├── preprocess.py
    ├── models.py
    ├── train.py
    ├── run_experiments.py
    ├── evaluate.py
    ├── plot_losses.py
    ├── plot_metrics.py
    └── utils.py

Environment Setup

Tested with Python 3.12.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Key packages: torch, numpy, pandas, scikit-learn, nltk, tqdm, matplotlib.

Data

The project uses the IMDb dataset (50,000 reviews, 25k train / 25k test, balanced classes).
Place the CSV in data/raw/IMDB_Dataset.csv. The preprocessing step lowercases text, removes punctuation, tokenizes, caps vocabulary at 10,000 words, and produces padded datasets for sequence lengths 25, 50, 100.

python src/preprocess.py

Model Configuration

All models share the same base configuration for fairness:
Embedding dimension: 100, Hidden size: 64, Layers: 2, Dropout: 0.3, Batch size: 32, Loss function: Binary Cross-Entropy,
Activations tested: Sigmoid, Tanh, ReLU, Optimizers tested: Adam, RMSprop, SGD, Sequence lengths: 25, 50, 100,
Gradient clipping: Enabled / Disabled, Epochs: 8, Seed: 42.

Controlled Experiment Design

Each experiment varies one factor at a time while fixing others to ensure valid comparison.

Stage Factor Varied Fixed Configuration Runs
A Architecture (RNN, LSTM, BiLSTM) ReLU + Adam + seq=50 + no clip 3
B Activation (Sigmoid, ReLU, Tanh) LSTM + Adam + seq=50 + no clip 3
C Optimizer (Adam, SGD, RMSprop) LSTM + ReLU + seq=50 + no clip 3
D Sequence Length (25, 50, 100) LSTM + ReLU + Adam + no clip 3
E Gradient Clipping (On vs Off) LSTM + ReLU + Adam + seq=50 2

Note: Duplicate runs with the exact same parameter configurations were avoided.

Run all experiments with:

python src/run_experiments.py

Evaluation

After training, generate plots and summary tables:

python src/evaluate.py

Outputs include accuracy/F1 plots, loss curves, and summary CSVs under results/plots.

Key Results

  • Best configuration: LSTM (ReLU, Adam, seq=100, no clip) - Accuracy: 0.815, F1: 0.815, Time/Epoch: 17.46s
  • Worst configuration: LSTM (ReLU, SGD, seq=50, no clip) - Accuracy: 0.500, F1: 0.498

Longer sequences improved performance, and Adam provided the best balance of speed and stability. Gradient clipping slightly stabilized training without significant accuracy gain.

Reproducibility

All experiments are reproducible. Random seeds were fixed across PyTorch, NumPy, and Python, and deterministic algorithms were enabled for consistent results.
All models were trained in a CPU-only environment with 8 GB RAM using the IMDb dataset’s standard split. Identical configurations guarantee the same metrics on rerun.

Outputs

  • Preprocessed data in data/
  • Metrics in results/metrics.csv
  • Plots in results/plots/
  • Final report: report.pdf

License

For academic and educational use only. Please refer to the IMDb dataset license on Kaggle for redistribution terms.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages