Phase 2 Report

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“Jnana Sangama”, Belagavi-590018, Karnataka

Report
on
“SPEECH BASED EMOTION RECOGNITION USING 2D
CNN LSTM NETWORKS”
Submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Engineering
in
Computer Science & Engineering
Submitted by

USN Name
1BI19CS078 KOUSHIK GG
1BI19CS079 LITHISH S
1BI19CS093 NAGARAJ BHAT
1BI19CS100 NIKHIL KORNAYA
Under the Guidance of

Dr. Harish Kumar B T
Associate Professor
Department of CS&E, BIT
Bengaluru-560004
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
K.R. Road, V.V. Pura, Bengaluru-560 004
2022-2023
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
“Jnana Sangama”, Belagavi-590018, Karnataka
BANGALORE INSTITUTE OF TECHNOLOGY

Bengaluru-560 004
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Certificate
This is to certify that the project work entitled “SPEECH BASED
EMOTION RECOGNITION USING 2D CNN LSTM NETWORKS”

carried out by
USN Name
1BI19CS078 KOUSHIK G G
bonafide students of VIII semester in partial fulfillment for the award of Bachelor of
Engineering in Computer Science & Engineering of the VISVESVARAYA
TECHNOLOGICAL UNIVERSITY, Belagavi during the academic year 2022-23. It is
certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the departmental library. The project report has been
approved as it satisfies the academic requirements in respect of Project work prescribed for
the said degree.
Project Guide HOD, CSE Principal, BIT

Dr. Harish Kumar B T Dr. Girija J Dr. Aswath M U
Associate Professor Professor and Head Principal, BIT
Name of the examiners & signature with date
1.
2.
ACKNOWLEDGEMENT
The knowledge & satisfaction that accompany the successful completion of any
task would be incomplete without mention of people who made it possible, whose guidance
and encouragement crowned our effort with success. We would like to thank all and
acknowledge the help we have received to carry out this project.
We would like to convey our sincere thanks to our college Bangalore Institute of
Technology, Dr. Aswath M U, Principal, for being kind enough to provide an opportunity
and platform to complete and present our final year project “Speech Based Emotion
Recognition Using 2D CNN LSTM Networks”.
We would also like to thank Dr. Girija J., Professor and Head of the Department
of Computer Science and Engineering, Bangalore Institute of Technology, for her
constant encouragement and ultimately present our final year project “Speech Based
Emotion Recognition Using 2D CNN LSTM Networks”.
We would like to thank our project coordinator Dr. Gunavathi H S, Assistant

Professor, for the cordial support, valuable information and guidance, which helped us in
completing this project through the various stages.
We are most humbled to mention the enthusiastic influence provided by our guide
Dr. Harish Kumar B T on the project for their ideas, time to time suggestions for being a
constant guide and co-operation showed during the venture and making this phase of the
project fruitful.
We would also take this opportunity to thank our friends and family for their
constant support and help. We are very much pleasured to express our sincere gratitude to
the friendly co-operation showed by all the staff members of Computer Science
Department, BIT.
1BI19CS078 KOUSHIK G G
ABSTRACT
Speech emotion recognition (SER) involves the identification of emotions
conveyed in spoken language through analysis of speech signals. With the growing
popularity of smart devices, SER has gained significant attention in recent years. One
approach to SER is to use deep learning models such as Convolutional Neural Networks
(CNNs) and Long Short-Term Memory (LSTM) networks. The purpose of this project is
to introduce a new methodology for performing SER using a 2D CNN-LSTM architecture.
The proposed model first uses a 2D CNN to extract the relevant features from the speech
signal, followed by an LSTM network for sequence modeling. We evaluated our model on
the Berlin Emotional Speech Database (EMO-DB), achieving state-of-the-art results. We
also compared our model's performance with other existing SER models and found that the
outcomes of our experiment outperformed them. The results of our project demonstrate that
the proposed 2D CNN-LSTM architecture is an efficient method for SER and can be used
in real-world applications such as emotion recognition from voice assistants, call centers,
and customer service applications.
TABLE OF CONTENTS
Page no.
CHAPTER 1: INTRODUCTION 1-2
1.1 Overview 1
1.2 Objectives 1
1.3 Purpose, Scope, and Applicability 1
1.3.1 Purpose 1
1.3.2 Scope 1
1.3.3 Applicability 2
1.4 Organization of Report 2
CHAPTER 2: LITERATURE SURVEY 3-9

2.1. Introduction 3
2.2. Summary of Papers 3
2.3. Drawbacks of Existing System 9
2.4. Problem Statement 9
2.5. Proposed Solution 9
CHAPTER 3: REQUIREMENT ENGINEERING 10-18

3.1 Software and Hardware Tools Used 10
3.1.1 Software Tools 10
3.1.2 Hardware Tools 11
3.2 Conceptual/ Analysis Modeling 11
3.2.1 Use Case Diagram 11
3.2.2 Sequence Diagram 12
3.2.3 Activity Diagram 14
3.2.4 Class Diagram 15
3.3 Software Requirements Specification 16
3.3.1 User Requirements 16
3.3.2 Functional Requirements 17
3.3.3 Non Functional Requirements 17
3.3.4 Domain Requirements 18
CHAPTER 4: PROJECT PLANNING 19
4.1 Project Planning and Scheduling 19
CHAPTER 5: SYSTEM DESIGN 20-23

5.1 System Architecture 20
5.2 Module Decomposition 21
5.3 Interface Design 21
5.4 Data Structure Design 22
5.5 Algorithm Design 22
CHAPTER 6: IMPLEMENTATION 24-33

6.1 Implementation Approaches 24
CHAPTER 7: TESTING 34-35

7.1 Testing Approach 34
7.1.1 Unit Testing 34
7.1.2 Integrated Testing 35
CHAPTER 8: RESULTS DISCUSSION AND PERFORMANCE

ANALYSIS 36-42
8.1 Results 36
8.1.1 Epoch vs Accuracy Graph 36
8.1.2 Epoch vs Loss Graph 36
8.1.3 Confusion Matrix 37
8.1.4 Loss Comparison 38
8.1.5 Accuracy Comparison 38
8.2 Snapshots 40
CHAPTER 9: APPLICATIONS & CONCLUSION 43-44

9.1 Conclusion 43
9.2 Applications 43
9.3 Limitations 43
9.4 Future Scope of the Work 44
REFERENCES 45-46
LIST OF FIGURES
Figure No. Figure Name Page No.
3.2.1 Use Case Diagram 11

3.2.2 Sequence Diagram for voice input 12
3.2.3 Sequence Diagram for preprocessing audio 12
3.2.4 Sequence Diagram for feature learning 13
3.2.5 Sequence Diagram for displaying result 14
3.2.6 Activity Diagram 15
3.2.7 Class Diagram 16
4.1 Planning and Scheduling of Project 19
5.1 Proposed Architecture 20
5.2 Interface Design 21
8.1.1 Epoch vs Accuracy graph 36
8.1.2 Epoch vs Loss graph 37
8.1.3 Confusion matrix 37
8.1.4 Loss graph comparison between 1D and 2D CNN 38
8.1.5 Accuracy graph comparison between 1D and 2D 39
CNN
8.2.1 Audio Recording Progress 40
8.2.2 Angry 40
8.2.3 Calm 40
8.2.4 Disgust 41
8.2.5 Fear 41
8.2.6 Happy 41
8.2.7 Neutral 42
8.2.8 Sad 42
LIST OF TABLES
Table No. Title Page No.
7.1.1 Unit Testing 35

7.1.2 Integration Testing 35
CHAPTER - 1
INTRODUCTION
CHAPTER 1
INTRODUCTION
1.1 Overview
The field of Speech Emotion Recognition (SER) is focused on the development of
computational models and algorithms that enable the detection and analysis of emotions
expressed in spoken language through the analysis of speech signals. The process involves
extracting acoustic features such as pitch, intensity, duration, and spectral characteristics
from speech signals and using these features as inputs to a classification model which
further predicts the emotion of the speaker. The emotional states that are detected through
SER include happiness, sadness, anger, fear, and other affective states. SER has numerous
applications in various domains, including mental health assessments, speech-enabled
virtual assistants, human-robot interaction, and customer service. The objective of SER is
to enable machines to detect, understand, and respond to emotions expressed through
speech, leading to new applications and improved human-machine interaction.
1.2 Objectives
• To accurately identify the emotions of the speaker
• To enhance human computer interaction.
• To detect changes in human emotional states.
1.3 Purpose, Scope, and Applicability

1.3.1 Purpose
The purpose of Speech Emotion Recognition is to enable machines to detect,
understand, and respond to emotions expressed through speech, leading to improved
human-machine interaction and applications in mental health assessments, virtual
assistants, customer service, entertainment, and more.
1.3.2 Scope
The scope of Speech Emotion Recognition (SER) is vast, and its applications are
numerous. SER can be used in mental health assessments, speech-enabled virtual assistants,
human-robot interaction, customer service, entertainment, and more. The potential
applications of Speech Emotion Recognition (SER) are vast and diverse, limited only by
our ability to imagine new use cases and possibilities. With the increasing availability of
Dept. of CSE, BIT 2022-23 1

Speech Emotion Recognition Using 2D CNN LSTM Networks
speech data and advances in machine learning, deep learning, and natural language
processing techniques, the scope of SER is rapidly expanding, leading to new and
innovative applications that can enhance the way we interact with technology and each
other.
1.3.3 Applicability
Speech Emotion Recognition has broad applicability in mental health, virtual
assistants, customer service, entertainment, education, and more. It enables machines to
detect and respond to human emotions in a more natural and empathetic way, improving
human-machine interaction and leading to new applications and innovations.
1.4 Organization of Report

• Chapter 1 describes the introduction, overview, objectives, purpose, scope and
applicability of our project.
• Chapter 2 of this document describes the Literature Survey. It provides details about
the existing system, the limitations that the existing system experiences and the
proposed system for the project.
• Chapter 3 of this document describes the Requirement Engineering. It includes overall

description and specific requirements. The overall requirement is classified as Software
and Hardware Tools Used, Conceptual/ Analysis Modeling, Use case diagram,
Sequence diagram, Activity diagram and Class Diagram. Specific requirements are
classified as software requirements, hardware requirements, functional requirements
and non-functional requirements and domain requirements.
• Chapter 4 is the Gantt Chart which is a bar chart showing the project schedule.
• Chapter 5 gives information about the system architecture, system design, interface
design and algorithm design.
• Chapter 6 contains the implementation of the project.
• Chapter 7 describes how the testing of the project is done.
• Chapter 8 contains a brief about the results and performance analysis.
• Chapter 9 describes the Conclusion, Applications, Limitations and Future Work of the
project.

CHAPTER - 2
LITERATURE SURVEY
CHAPTER 2
LITERATURE SURVEY
A literature survey or review is a section that presents an analysis of previous

research and publications in a specific field of interest, taking into consideration the
project's parameters and scope.
2.1 Introduction
The field of Speech Emotion Recognition (SER) encompasses the creation of
computational models and algorithms that enable the identification and analysis of
emotions conveyed through spoken language by examining speech signals. SER is an
interdisciplinary field that draws upon expertise from speech processing, machine learning,
psychology, and neuroscience. The goal of SER is to enable machines to detect, understand,
and respond to human emotions expressed through speech. This can have a significant
impact on many fields, including mental health assessments, speech-enabled virtual
assistants, customer service, education, entertainment, and more. SER can help improve
mental health care by providing clinicians with a new tool for assessing and monitoring
changes in patients' emotional states. It can also enhance the user experience of virtual
assistants and chatbots by enabling them to interact with users in a more personalized and
empathetic way. Furthermore, SER can be used to develop emotion-based content filtering
and recommendation systems, creating new opportunities in the entertainment industry. As
the availability of speech data and advances in machine learning and natural language
processing techniques continue to grow, the potential of SER is vast, leading to new and
innovative applications that can improve human-machine interaction and our
understanding of human emotions.
2.2 Summary of Papers

“Speech Emotion Recognition System Using Gaussian Mixture Model and
Improvement proposed via Boosted GMM” [1]. This paper proposes a Speech Emotion
Recognition (SER) system using a Gaussian Mixture Model (GMM) and presents an
improvement to the system using a Boosted GMM approach. The proposed system aims to
accurately classify emotions into four categories: happiness, sadness, anger, and neutral.
The system extracts a set of low-level descriptors from speech signals, such as Mel-
Frequency Cepstral Coefficients (MFCC), Pitch, and Energy, which are used as inputs to

the GMM classifier. The proposed system is evaluated on the Berlin database of emotional
speech, achieving an overall classification accuracy of 68.27%. To improve the accuracy
of the system, the authors propose a Boosted GMM approach that utilizes a combination of
multiple GMMs to improve the performance of the classifier. The proposed approach
achieves an overall classification accuracy of 76.44%, representing a significant
improvement over the baseline GMM classifier. The results demonstrate the effectiveness
of the proposed approach in improving the accuracy of SER systems and highlight the
potential of using Boosted GMMs for emotion classification in speech signals.
Drawbacks of the proposed SER system include dependency on training data quality,
computational complexity, limited emotion classification, and sensitivity to noise and
variability.
“Emotion Recognition on The Basis of Audio Signal Using Naive Bayes

Classifier” [2] This paper presents a system for Emotion Recognition (ER) based on audio
signals using a Naive Bayes classifier. The system aims to classify emotions into three
categories: neutral, happy, and sad, by analyzing features extracted from audio signals. The
features extracted include Mel-Frequency Cepstral Coefficients (MFCC), Zero Crossing
Rate (ZCR), and Energy, which are used as inputs to the Naive Bayes classifier. The
proposed system is evaluated on the Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS) and achieves an overall classification accuracy of 66.11%. The
results demonstrate the effectiveness of the proposed system in accurately classifying
emotions in audio signals and highlight the potential of Naive Bayes classifiers for ER
applications. The system can have a significant impact on many fields, including mental
health assessments, human-computer interaction, and speech-enabled virtual assistants, by
enabling machines to detect and respond to human emotions expressed through audio
signals. With its simple and computationally efficient design, the proposed system is ideal
for real-time applications., and can be extended to other emotion classification tasks beyond
the scope of this study.
Drawbacks: Limitations of the proposed SER system include limited emotion

classification, dependency on feature selection, and lower accuracy compared to other
methods.
“Modelling speech emotion recognition using logistic regression and decision

trees” [3] Speech emotion recognition is an important task in the field of human-computer
interaction, which aims to automatically detect the emotional state of a speaker based on

their speech signal. In this study, we investigate the use of logistic regression and decision
trees for modelling speech emotion recognition. We use the Berlin Database of Emotional
Speech (Emo-DB) dataset, which contains speech samples of ten different emotions spoken
by actors in a neutral, calm tone. We extract a set of acoustic features from each speech
sample, including pitch, energy, and spectral features. We then train logistic regression and
decision tree models using these features to classify the emotional state of the speakers.
Our results show that both logistic regression and decision trees can achieve high accuracy
in classifying the emotional state of speakers. In terms of accuracy, the logistic regression
model attained 76.5%, whereas the decision tree model achieved 73.3%. We also observed
that certain features, such as pitch and energy, were more important than others in
predicting emotional state. Overall, our study shows that logistic regression and decision
trees hold great potential for speech emotion recognition, and emphasizes the significance
of feature selection in enhancing the accuracy of classification. Future work could explore
the use of other machine learning algorithms and feature sets to further improve the
performance of speech emotion recognition systems.
Drawbacks: The study has limited evaluation metrics and uses only one dataset, leading
to potential overfitting. Comparison with other models is also insufficient, and the feature
set used is limited.
“Speaker Dependent Speech Emotion Recognition using MFCC and Support

Vector Machine” [4]. This study explores speaker-dependent speech emotion recognition
using Mel-frequency cepstral coefficients (MFCC) and Support Vector Machine (SVM) on
the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. The dataset
contains audio recordings of dyadic conversations with four emotional labels. MFCC
features are extracted and used to train an SVM classifier for each speaker individually.
The study achieved an average accuracy of 65.5% for the four emotions, with the highest
accuracy of 73.8% for the angry emotion. The results show the effectiveness of speaker-
dependent models and suggest the potential of using MFCC and SVM for speech emotion
recognition. Further research could explore the use of other feature sets and classification
algorithms to improve the accuracy of speaker-dependent models.
Drawbacks: The study only uses one dataset and evaluation metric, potentially limiting
generalizability. The speaker-dependent approach requires individual training, which may
not be practical for large-scale applications.

“Emotion Recognition using Hybrid Gaussian Mixture Model and Deep
Neural Network” [5] This study proposes a hybrid approach for emotion recognition,
combining a Gaussian mixture model (GMM) and deep neural network (DNN) on the
AffectNet dataset. The GMM is used to model the audio features and extract emotion
probabilities, which are then fed into a DNN to further refine the classification. The study
achieves an accuracy of 63.34%, outperforming both the GMM-only and DNN-only
approaches. The hybrid approach also shows good performance in recognizing neutral and
compound emotions. The study suggests the potential of combining GMM and DNN for
speech emotion recognition, and further research could explore the use of other machine
learning algorithms and feature sets.
Drawbacks: The study only uses one dataset, and the evaluation metrics used are not as
comprehensive as those in other studies. The hybrid approach requires training two separate
models, which may not be practical for real-time applications.
“Hidden Markov Models For Speech Emotion Recognition” [6] This study
explores the use of Hidden Markov Models (HMM) for speech emotion recognition, using
the Emotional Prosody Speech and Transcripts (EPST) database. The study uses the Mel-
frequency cepstral coefficients (MFCC) as features and trains a separate HMM model for
each emotional state. The results show that HMM achieved an overall accuracy of 74.6%
for six emotions, outperforming other machine learning algorithms such as k-Nearest
Neighbors and SVM. The study suggests the potential of HMM for speech emotion
recognition and encourages further research to improve its performance with larger datasets
and more complex models.
Drawbacks: The study only uses one dataset and evaluation metric, which limits
generalizability. The use of separate models for each emotional state may not be practical
for real-time applications. The accuracy achieved, while high, is not as comprehensive as
in other studies that use multiple evaluation metrics.
“Convolutional Neural Network (CNN) Based Speech-Emotion Recognition”

[7] This study explores the use of Convolutional Neural Networks (CNN) for speech
emotion recognition, using the RAVDESS dataset. The study uses raw audio signals as
input and trains a CNN model with multiple convolutional and pooling layers. The study
achieved an accuracy of 70.18% for eight emotions, outperforming other machine learning
algorithms such as k-Nearest Neighbors and SVM. The study also explores the effect of
different CNN architectures on classification accuracy, showing that deeper networks with

more layers perform better. The study suggests the potential of CNN for speech emotion
recognition and encourages further research to improve its performance with larger datasets
and more complex models.
Drawbacks: Some drawbacks of using Convolutional Neural Networks (CNNs) for

speech-emotion recognition include their limited ability to capture temporal information,
their high computational cost, and the need for large amounts of labeled data for training.
“Speech Emotion Recognition Using Speech Feature and Word Embedding”

[8] This study proposes a hybrid approach for speech emotion recognition, combining
speech features and word embeddings on the IEMOCAP dataset. The study uses Mel-
frequency cepstral coefficients (MFCC) and prosodic features as speech features and trains
a separate Support Vector Machine (SVM) model for each feature set. Word embeddings
are also used to capture the contextual information of spoken words, and a separate SVM
model is trained on the word embedding features. The outputs of the three SVM models
are then combined using a weighted average. The study achieves an accuracy of 71.37%
for four emotions, outperforming other machine learning algorithms such as Random
Forest and k-Nearest Neighbors. The study suggests the potential of combining speech
features and word embeddings for speech emotion recognition and encourages further
research to improve its performance with larger datasets and more complex models.
Drawbacks: Some drawbacks of using speech features and word embedding for speech
emotion recognition include their limited ability to capture contextual information, the need
for domain-specific language models, and the difficulty of representing emotions that are
expressed through non-verbal cues. Additionally, these methods may not be suitable for
real-time emotion recognition applications due to their high computational cost and the
need for significant processing power.
“Speech Emotion Recognition Using K-Nearest Neighbor Classifiers” [9] This

study investigates the use of k-Nearest Neighbor (k-NN) classifiers for speech emotion
recognition, using the IEMOCAP dataset. The study utilizes Mel-frequency cepstral
coefficients (MFCC) and prosodic features as input features and trains a separate k-NN
model for each feature set. The study experiments with different k values and distance
metrics for the k-NN model and finds that a k value of 3 and Manhattan distance metric
achieve the best performance. The study achieves an accuracy of 59.86% for four emotions
and shows that MFCC features perform better than prosodic features. The results suggest
that k-NN can be a useful tool for speech emotion recognition, especially when working

with smaller datasets. However, the study also shows the limitations of using k-NN, such
as the need for extensive training data and the high computational cost of computing
distances in high-dimensional feature spaces. Overall, the study suggests the potential of k-
NN for speech emotion recognition and encourages further research to improve its
performance with larger datasets and more advanced models.
Drawbacks: Some drawbacks of using K-Nearest Neighbor (KNN) classifiers for speech
emotion recognition include their sensitivity to irrelevant features, the need for large
amounts of labeled data, and the difficulty of optimizing the distance metric used for
classification. Additionally, KNN classifiers may not be suitable for real-time applications
due to their high computational cost and the need for significant memory resources.
“Emotion Recognition from Speech using Artificial Neural Networks and

Recurrent Neural Networks” [10] This study explores the use of Artificial Neural
Networks (ANNs) and Recurrent Neural Networks (RNNs) for speech emotion
recognition, using the MSP-IMPROV dataset. The study uses Mel-frequency cepstral
coefficients (MFCC) and prosodic features as input features and trains separate ANNs and
RNNs for each feature set. The study evaluates the performance of different ANN and RNN
architectures and shows that the RNN models achieve higher accuracy than the ANN
models. The study achieves an accuracy of 72.60% for five emotions using the RNN model
with MFCC features, outperforming the ANN model with MFCC features and other
machine learning algorithms such as SVM and Random Forest. The study suggests that
RNNs can be a useful tool for speech emotion recognition, especially when working with
sequential data such as speech signals. However, the study also shows the limitations of
using RNNs, such as the need for extensive training data and the high computational cost
of training complex models. Overall, the study suggests the potential of RNNs for speech
emotion recognition and encourages further research to improve its performance with larger
datasets and more advanced models.
Drawbacks: Some drawbacks of using Artificial Neural Networks (ANNs) and Recurrent
Neural Networks (RNNs) for speech emotion recognition include the difficulty of
interpreting the learned representations, the need for large amounts of labeled data for
training, and the tendency to overfit the training data. Additionally, RNNs may suffer from
vanishing gradients during training, which can lead to difficulties in capturing long-term
dependencies in the data.

2.3 Drawbacks of Existing System

• There is no consistency in the accuracy for different emotions
• Typically designed to operate on short segments of input data at a time
• Involve higher training time
• Not capable for noisy environments
• Require fixed-length input sequences
2.4 Problem Statement

“To take human speech as an audio input and produce an emoji as an output
depicting the analyzed emotion.”
2.5 Proposed Solution

The system proposes an application using audio samples of speech as input with the
help of audio recorder and 2D CNN model consisting of four LFLBs and one LSTM layer
for emotion classification.

CHAPTER - 3
REQUIREMENT ENGINEERING
CHAPTER 3
REQUIREMENT ENGINEERING
3.1 Software and Hardware Tools used
Some of the specific requirements of the proposed system are:
3.1.1 Software Tools
• Streamlit: It allows developers to create web apps with minimal setup, using Python
code to create interactive data visualizations, dashboards, and more.
• Pyaudio: It provides a simple and flexible interface for capturing and playing back
audio data in real-time, making it a better option for speech recognition and music
processing applications.
• Wave: It supports reading and writing of uncompressed WAV files and compressed
formats like MP3 and Ogg Vorbis, making it a versatile tool for audio file manipulation.
• Numpy: It provides support for many mathematical operations on arrays and matrices,
including linear algebra, Fourier transforms, and random number generation.
• TensorFlow: It includes different high-level APIs for building and training neural
networks, and support for distributed training and deployment on different platforms.
• Sklearn: It provides many machine learning algorithms, evaluation metrics and data
preprocessing tools for building and evaluating predictive models.
• Pandas: It provides support for data cleaning, reshaping, and analysis, including
powerful tools for data aggregation and grouping.
• Seaborn: It provides support for a large number of statistical graphics, including

heatmaps, bar plots, and violin plots, making it a popular tool for data exploration and
visualization.
• Pathlib: It provides an object-oriented interface for working with file paths and
directory operations, making it easier to write portable and platform-independent code.
• Librosa: It provides a variety of tools for manipulating audio data, including support
for time-frequency analysis, feature extraction, and spectral processing.
• Matplotlib: It provides many visualization tools, including scatter plots, line plots, and
3D graphics, with support for customization of every aspect of the visualization.

• Anaconda: It includes many pre-installed packages and tools for scientific analysis and
data science, making it easy to commence analyzing data and modeling using Python.
• Spyder: It provides a powerful and flexible IDE for working and code, including
support for code completion, and debugging tools.
3.1.2 Hardware Tools
• Workable Inbuilt mic

• Pentium IV or higher
• 25 Gb hard free drive space
• 8 GB RAM
• VGA and High-Resolution Monitor
3.2 Conceptual/ Analysis Modeling

3.2.1 Use case diagram
Figure 3.2.1: Use Case Diagram
Figure 3.2.1 represents the use case diagram of SER, where user inputs audio to the
algorithm to generate the identified output. The use case and actors are represented. An use
case is represented by eclipse shape namely voice input, pre-processing, feature learning
and graphical result.
In the figure, the User initiates the process by speaking and recording their speech
data. The System then analyzes the speech data, extracts features, and classifies emotions.
Finally, the System returns the emotion result to the User. This diagram shows the overall
flow of the SER process, with arrows indicating the direction of communication between
the User and the System.

3.2.2 Sequence diagram
Figure 3.2.2 Sequence Diagram for voice input
In Figure 3.2.2 sequence diagram, the audio data is first recorded using an audio
recording device, such as microphone. The audio data is then passed to a Pre-processing
object, where any background noise is removed and the data is filtered and normalized. The
pre-processed audio data is then passed to a Feature Extraction object, where features such
as spectral characteristics, energy, and pitch are extracted. Finally, the extracted features
are passed to an Emotion Classification object, which classifies the speaker's emotion.
Figure 3.2.3 Sequence diagram for preprocessing audio
In Figure 3.2.3 sequence diagram, shows the flow of messages between the objects
involved in the audio preprocessing stage before speech emotion recognition. It highlights
the importance of preprocessing audio data before performing SER, and shows the
dependencies between the pre-processing, feature extraction, and emotion classification
steps.

The audio data is recorded and pre-processed by a Pre-processing object. The pre-
processed audio data is passed to a Feature Extraction object, which extracts relevant
features from the audio data. The extracted features are then passed to a Feature Learning
process, which trains a machine learning model to learn the patterns in the features that are
associated with different emotions.
Figure 3.2.4 Sequence diagram for feature learning
The trained model is then used to extract learned features from the audio data.
Finally, the extracted features are passed to an "Emotion Classification" object, which
classifies the speaker's emotion.
Figure 3.2.4 sequence diagram shows the flow of messages between the objects
included in the feature learning process in SER. It highlights the significance of feature
learning in accurately classifying the speaker's emotion, and shows the dependencies
between the pre-processing, feature extraction, feature learning, and emotion classification
steps.
In Figure 3.2.4 sequence diagram, the Output object produces a classification result
indicating the speaker's emotion. The classification result is then passed to an Emotion
Display object, which displays the result to the user through a graphical user interface.

Figure 3.2.5 Sequence diagram for displaying result
The user then does interaction with the system by providing feedback or performing
an action. The user's action is captured by the "User Interface" object, which then performs
the corresponding action.
Figure 3.2.5 diagram shows the movement of messages between the objects
involved in displaying the result of SER to the user. It highlights the importance of
providing a clear and intuitive user interface.
3.2.3 Activity diagram

Figure 3.2.6 describes a graphical representation of a series of steps and actions that
supports decision-making, repetition, and simultaneous execution. The User begins by
speaking, and the speech is recorded. The recorded speech is then analyzed for its speech
data. After analyzing the speech data, the features of the speech are extracted. Then, the
emotions of the speaker are classified based on these features. Finally, the result of the
emotion classification is returned to the User.
This diagram shows the different tasks that take place during the SER process and
the flow between them. The diagram also shows that the result of the SER process is a
single output: the emotion classification result.

Figure 3.2.6: Activity Diagram
3.2.4 Class Diagram
• Figure 3.2.4 class diagram we have added several new classes.

• The "User" class represents the user who initiates the SER process by recording and
analysing their speech data.
• The "Pre-processing" class represents the various pre-processing techniques that are
applied to the speech data, such as normalization, resampling, and filtering.
• The "Datasets" class represents the various datasets that are used to train and test the
emotion classification model.
• The "Extraction" class extracts features from the speech data, such as pitch, intensity,
and spectral properties.
• The "Result" class represents the output of the SER process, which includes audio data
and the corresponding emotion label.

Figure 3.2.7: Class Diagram
The above shown class diagram, shown in figure 3.2.4, shows the different tasks
that take place during the SER process and the flow between them. It emphasizes the
sequence of actions that appear during the SER process, and the dependencies between
them.
3.3 Software Requirements Specification

3.3.1 User Requirements
• Simple and dynamic interface
The proposed system should possess a simple, fluid and a dynamic user interface that
gives the user freedom to navigate through the system devoid of cumbersomeness.
• Accurate analysis
The proposed system should accurately analyze the given dataset and predict proper
colors for the images in the dataset without any/with minimum fatal errors.
• Real Time Conversion

Real-time conversion refers to converting data or signals from one format to another in
real-time, as the data or signals are being generated or transmitted.

• Accessibility
All the features should be easily accessible by all categories of users. It refers to the
extent to which a software can be used by people having disabilities, including visual,
cognitive auditory, and physical disabilities.
3.3.2 Functional Requirements
• Data Pre-processing
Data preprocessing refers to the steps taken to clean, transform, and prepare raw data
for analysis. In functional requirements, data preprocessing may be included as a necessary
step to achieve specific software functionalities.
• Training
Learning how to perform the required task based on the inputs given through the
dataset. In the functional requirements, training refers to training a machine learning model
to do a specific task or accomplish a specific goal.
• Forecasting
Making predictions of the future based on past and present data by analysis of trends.
It refers to the capacity of a software system to make predictions or forecasts based on
historical data or other relevant factors.
3.3.3 Non-Functional Requirements
• Performance
The training time should be significantly reduced by using parallel processing of the
distributed dataset.
• Portability
The system must be possible to run on many systems without doing a lot of changes.
• User Friendly
As the main goal is to provide an end-to-end user interface, it should be easy for users
to use the WebApp and process the images.
• Reliability
The system has to produce fast and accurate results.

3.3.4 Domain Requirements
• Emotion recognition
Emotion recognition refers to the capacity of a software system to detect and identify
emotions or affective states in speech, text, images, or other types of input. In speech
emotion recognition (SER), emotion recognition refers specifically to the capacity to detect
and identify emotions or affective states in speech signals.
• Integration
Integration refers to the capacity of a software system to work together with other
software or hardware systems. In speech emotion recognition (SER), integration might
refer to the capacity of a SER system to integrate with call center software, audio processing
equipment that are used in the domain or context where SER system will be deployed.

CHAPTER - 4
PROJECT PLANNING
CHAPTER 4
PROJECT PLANNING
4.1 Project Planning and Scheduling
Project planning and scheduling are closely related and often go hand in hand in
software development projects. Project planning involves defining the goals, scope,
timelines, and resources required to complete a project. Scheduling involves creating a
timeline or schedule that specifies when specific tasks and activities will be completed.
Henry Gantt developed a visual representation of a project schedule called the Gantt
chart. This uses bars to show the start and end dates of terminal and summary elements
within the project's work breakdown structure. Simply put, Gantt charts display the timeline
of a project's tasks and milestones.
The following figure 4.1 is the Gantt chart of our project “Speech Based Emotion
Recognition Using 2D CNN LSTM Networks”
Figure 4.1: Planning and Scheduling of Project

CHAPTER - 5
SYSTEM DESIGN
CHAPTER 5
SYSTEM DESIGN
5.1 System Architecture
In the figure 5.1 shown below, system architecture is displayed:
Figure 5.1: Proposed Architecture
Data Collection: Speech emotion recognition (SER) using CNN-LSTM typically requires
a dataset of audio recordings that are labelled with the corresponding emotions expressed
by the speakers. The dataset should be diverse in terms of speakers, languages, accents, and
emotions to ensure that the trained model can generalize well to unseen data.
Pre-processing: The pre-processing step involves converting the raw audio recordings into
a suitable format that can be fed into the CNN-LSTM model for training and evaluation.
Here first step is extracting relevant features from raw audio recordings is a critical step in
SER. Next step is normalization where the scaling the input features is to a common range.
Training: The training data consists of audio recordings that are labeled with the emotions
expressed by the speakers. The goal of training is to build a model that learn to recognize
the patterns in the audio data associated with different emotions. The training data is
typically pre-processed to extract relevant features. After pre-processing, the data is
partitioned into separate training and validation sets.
Output: Produce an emoji as an output depicting the analyzed emotion.
Department of CSE, BIT 2022-2023 20

5.2 Module Decomposition

There are 4 main modules used in this project
• Data Preprocessing
This module involves pre-processing the raw audio data to extract relevant features,
such as Mel-frequency cepstral coefficients (MFCCs), which are commonly used in speech
processing tasks. The pre-processed data is then split into training and validation sets.
• Convolutional Neural Network

In this module, the pre-processed audio features are input into a CNN, which extracts
higher-level features from input data. The CNN typically consists of multiple layers,
including pooling layers, convolutional layers, and activation functions.
• Long Short-Term Memory

The output of the CNN is fed into an LSTM, which is a type of recurrent neural network
(RNN) that is designed to capture temporal dependencies in the input data. The LSTM is
able to process sequences of input data and remember data from previous time steps.
• Output Layer
The output of the LSTM is fed into a fully connected layer, which maps the input to the
corresponding emotion labels. The output layer typically uses a softmax activation function
to output the probabilities of input belonging to each of the possible emotion categories.
5.3 Interface Design
Figure 5.2: Interface Design
Interface design plays an important role in speech emotion recognition (SER)

systems as it directly impacts the user experience. A well-designed interface can improve
the usability and effectiveness of the system by facilitating intuitive interaction and
minimizing user error. The interface provides clear instructions on how to interact with the

system and what information is being conveyed. Visualizations such as graphs, charts, and
images can aid in communicating complex information and enhancing user engagement.
Additionally, the interface should be designed with consideration for accessibility or
special needs.
Frontend design includes a homepage which contains start and end recording
buttons. When the recording is ended, a new predict emotion button appears and the audio
file that is being recorded is stored in the storage. The model predicts the emotion from the
stored audio and displays the corresponding emotion in the frontend along with the label.
5.4 Data Structure Design

• List: Lists are an ordered data structure that stores elements sequentially and can be
accessed by the index of the elements.
• Numpy ndarray: It is a multidimensional container of items of the same type and size.
• Tuple: Tuples provide a way of storing multiple values within a single variable, and
they contain any number of items of varying types, such as integers, floats, lists, strings,
and so on.
• Dictionary: Dictionary is a built-in data structure that allows us to store and retrieve
key-value pairs.
5.5 Algorithm Design

• Inputs
o Speech dataset (audio files and their corresponding emotion labels)
o Hyperparameters (e.g., learning rate, number of epochs, batch size, etc.)
• Outputs
o Trained CNN-LSTM model
o Load the speech dataset and extract features (e.g., Mel frequency cepstral
coefficients, log-mel spectrogram, etc.) from the audio files.
• Partition the dataset into distinct training, validation, and test sets.
• Define the CNN-LSTM model architecture:
o Define the CNN layers to extract temporal features from the audio signals.
o Define the LSTM layer to model the temporal dependencies in the feature sequence.
o Define the output layer and it contains softmax activation to classify the emotion
into one of predefined categories (e.g., happy, sad, angry, etc.).

• Compile the model with appropriate loss function and optimizer.
• Train the model on the training data using batches of audio features and their
corresponding emotion labels:
o Feed the audio features to the CNN layers to extract the temporal features.
o Feed the temporal features to the LSTM layer to model the temporal dependencies.
o Compute the output probabilities using the softmax activation function.
o Compute the loss between the predicted and true emotion labels.
o Use backpropagation to update the model parameters to minimize the loss.
• Evaluate the model performance on the validation set:

o Compute the prediction accuracy and other evaluation metrics (e.g., precision, recall,
F1-score, etc.).
o Fine-tune the model hyperparameters (e.g., learning rate, number of epochs, batch
size, etc.) based on the validation set performance.
• Test the final model on the test set:

o Compute the prediction accuracy and other evaluation metrics.
o Analyze the confusion matrix to see how the model performs for each emotion
category.
• Save the trained model for future use.

CHAPTER - 6
IMPLEMENTATION
CHAPTER 6
IMPLEMENTATION
6.1 Implementation Approaches
main.py
import dataload2d
import dataload1d
import train2d
import train1d
import test2d
import test1d
import compare
from tensorflow.keras.utils import normalize, to_categorical
path = "./res/dataset"
if __name__ == '__main__':
train_x, train_y, validation_x, validation_y, test_x, test_y = dataload2d.load_data(path)
train_x = normalize(train_x)
validation_x = normalize(validation_x)
test_x = normalize(test_x)
x = test_y;
train_y = to_categorical(train_y)
validation_y = to_categorical(validation_y)
test_y = to_categorical(test_y)
train2d.train(train_x, train_y, validation_x, validation_y)
test2d.test(test_x, test_y, x)
dataload2d.py
import librosa
import pathlib
import numpy as np
from sklearn.model_selection import traintest_split
def get_log_mel_spectrogram(path, n_fft, hop_length, n_mels):
y, sr = librosa.load(path, sr=16000, duration=8)
file_length = np.size(y)
if file_length != 128000:

y = np.concatenate((y, np.zeros(128000-file_length)), axis=0)
melspectrogram = librosa.feature.melspectrogram(y=y, s=s, n_mels=128, fmax=8000)
log_mel_spectrogram = librosa.amplitude_to_db(mel_spectrogram)
log_mel_spectrogram = log_mel_spectrogram.reshape((-1,))
return log_mel_spectrogram
def classify_files(path):
"""
W | anger
L | boredom
E | disgust
A | anxiety/fear
F | happiness
T | sadness
"""
dataset_dict = {
'total': 0,
'file_dict': {
'W': {'represent': 0, 'count': 0, 'all_data': []},
'L': {'represent': 1, 'count': 0, 'all_data': []},
'E': {'represent': 2, 'count': 0, 'all_data': []},
'A': {'represent': 3, 'count': 0, 'all_data': []},
'F': {'represent': 4, 'count': 0, 'all_data': []},
'T': {'represent': 5, 'count': 0, 'all_data': []},
'N': {'represent': 6, 'count': 0, 'all_data': []}
}
}
wav_path = pathlib.Path(path)
emotion_file_list = [str(file_name) for file_name in wav_path.glob('*.wav')]
p = len(str(wav_path))
emotion_label_list = dataset_dict['file_dict'].keys()
for emotion_label in emotion_label_list:
emotion_classify_file_list = [letter for letter in emotion_file_list if letter[p + 6] ==
emotion_label]
files_count = len(emotion_classify_file_list)

dataset_dict['file_dict'][emotion_label]['count'] = files_count
dataset_dict['total'] = dataset_dict['total'] + files_count
emotion_data = [get_log_mel_spectrogram(path, n_fft=2048, hop_length=512,
n_mels=128)
for path in emotion_classify_file_list]
dataset_dict['file_dict'][emotion_label]['all_data'] = emotion_data
return dataset_dict
def load_data(path):
train_data_x = []
train_data_y = []
validation_data_x = []
validation_data_y = []
test_data_x = []
test_data_y = []
dataset_dict = classify_files(path)
'''Split data set'''
emotion_label_list = dataset_dict['file_dict'].keys()
for emotion_label in emotion_label_list:
x = dataset_dict['file_dict'][emotion_label]['all_data']
count = dataset_dict['file_dict'][emotion_label]['count']
y = np.full(count, dataset_dict['file_dict'][emotion_label]['represent'])
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, train_size=0.8)
train_data_x = np.append(train_data_x, x_train)
train_data_y = np.append(train_data_y, y_train)
validation_data_x = np.append(validation_data_x, x_val)
validation_data_y = np.append(validation_data_y, y_val)
test_data_x = np.append(test_data_x, x_test)
test_data_y = np.append(test_data_y, y_test)
'''Reshape all data'''
train_data_x = np.array(train_data_x).reshape(-1, 128, 251, 1)
train_data_y = np.array(train_data_y)
validation_data_x = np.array(validation_data_x).reshape(-1, 128, 251, 1)
validation_data_y = np.array(validation_data_y)
test_data_x = np.array(test_data_x).reshape(-1, 128, 251, 1)

test_data_y = np.array(test_data_y)
return train_data_x, train_data_y, validation_data_x, validation_data_y, test_data_x,
test_data_y
cnn2d.py
from tensorflow import keras
from tensorflow.keras import layers
def model2d(input_shape, num_classes):
model = keras.Sequential(name='model2d')
#LFLB1
model.add(layers.Conv2D(filters=64,
kernel_size=3,
strides=1,
padding='same',
# data_format='channels_first',
input_shape=input_shape
)
)
model.add(layers.BatchNormalization())
model.add(layers.Activation('elu'))
model.add(layers.MaxPooling2D(pool_size=2, strides=2))
#LFLB2
kernel_size=3,
strides=1,
padding='same',
)
)
#LFLB3
kernel_size=3,
strides=1,

padding='same',
)
)
#LFLB4
kernel_size=3,
strides=1,
padding='same'
)
)
model.add(layers.Reshape((-1, 128)))
#LSTM
model.add(layers.LSTM(32))
model.add(layers.Dense(units=num_classes, activation='softmax'))
opt = keras.optimizers.legacy.Adam(learning_rate=0.0006, decay=1e-6)
model.compile(optimizer=opt,
loss='categorical_crossentropy',
metrics=['categorical_accuracy']
)
return model
train2d.py
import cnn2d
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import matplotlib.pyplot as plt
def train(train_x, train_y, validation_x, validation_y):
model = cnn2d.model2d(input_shape=(128, 251, 1), num_classes=7)
es = EarlyStopping(monitor='val_loss',
mode='min',

verbose=0,
patience=20)
mc = ModelCheckpoint('./res/models/model2d.h5',
monitor='val_categorical_accuracy',
mode='max',
verbose=0,
save_best_only=True)
history = model.fit(train_x, train_y,
validation_data=(validation_x, validation_y),
epochs=100,
batch_size=4,
verbose=2,
callbacks=[es, mc])
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
test2d.py
from tensorflow.keras.models import load_model
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
import pandas as pd
import seaborn as sns
def test(test_x, test_y, x):

model = load_model('./res/models/model2d.h5')
model.evaluate(test_x, test_y, batch_size=1)
y_pred = model.predict(test_x)
y_pred_classes = LabelEncoder().fit_transform(y_pred.argmax(axis=1))
y_test_classes = LabelEncoder().fit_transform(x)
conf_matrix = confusion_matrix(y_test_classes, y_pred_classes)
conf_df = pd.DataFrame(conf_matrix, index=['Angry', 'Bored', 'Disgust', 'Fear', 'Happy',
'Sad', 'Neutral'],
columns=['Angry', 'Bored', 'Disgust', 'Fear', 'Happy', 'Sad', 'Neutral'])
sns.heatmap(conf_df, cmap='Blues', annot=True, fmt='g')
Speech Emotion Recognition.py

import streamlit as st
import recorder
import prediction
st.markdown("""
<style>
div.stButton > button {
background-color: #0099ff;
color:#ffffff;
position: relative;
left: 280px;
}
div.etr89bj1 > img {
position: relative;
left: 270px;
}
div.e1j25pv61 > div.eyqtai90{
position: relative;
color: black;
left: 130px;
}
</style>
<h1 style='text-align: center; color: blue; margin-top:-70px'>Speech Emotion
Recognition</h1>

<h2 style='text-align: center; color: blue; margin-top:-20px'>Record your own voice</h1>
""", unsafe_allow_html=True)
if st.button('Start Recording' , key="1"):
st.markdown('''<p style='text-align: center; color: black;'>Recording...</p>''',
unsafe_allow_html=True)
recorder.recording()
if st.button("Predict Emotion"):
prediction.prediction()
Recorder.py
import pyaudio
import wave
def recording():
CHUNK = 1024
FORMAT = pyaudio.paInt16 #paInt8
CHANNELS = 2
RATE = 44100 #sample rate
WAVE_OUTPUT_FILENAME = "C:/Users/koush/.spyder-py3/Speech Emotion
Recognition/res/audios/audio.wav"
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK) #buffer
frames = []
if 'frames' not in st.session_state:
st.session_state.frames = ''
flag = False
if st.button("Stop Recording"):
flag = True
st.markdown('''<p style='text-align: center; color: black;'>Recorded</p>''',
unsafe_allow_html=True)
while not flag:

data = stream.read(CHUNK)
frames.append(data) # 2 bytes(16 bits) per channel
st.session_state.frames = frames
stream.stop_stream()
stream.close()
p.terminate()
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(st.session_state.frames))
del st.session_state.frames
wf.close()
Prediction.py
import dataload2d
import numpy as np
from tensorflow.keras.utils import normalize
from tensorflow.keras.models import load_model
import wave
def prediction():
audio_file = 'C:/Users/koush/.spyder-py3/Speech Emotion
Recognition/res/audios/audio.wav'
try:
with wave.open(audio_file, 'rb') as audio:
audio_data = audio.readframes(-1)
log_mel_spectrogram_new = dataload2d.get_log_mel_spectrogram(audio_file,
n_fft=2048, hop_length = 512, n_mels=128)
X_new = np.array(log_mel_spectrogram_new).reshape(-1, 128, 251, 1)
X_new = normalize(X_new)
model = load_model('C:/Users/koush/.spyder-py3/Speech Emotion
Recognition/res/models/model2d.h5')
emotion_pred = np.argmax(model.predict(X_new))
except FileNotFoundError:

st.text("Audio file not found")
emotion_pred = 6
except wave.Error:
st.text("Audio file corrupted")
emotion_labels = ['angry', 'calm', 'disgust', 'fearful', 'happy', 'sad', 'neutral']
emotion_label = emotion_labels[emotion_pred]
try:
image = open('C:/Users/koush/.spyder-py3/Speech Emotion Recognition/res/images/'
+ emotion_labels[emotion_pred] + ".png", "rb").read()
st.image(image, width=150)
st.text(f"The predicted emotion for the audio file is: {emotion_label}")
except:
st.text(f"The predicted emotion for the audio file is: {emotion_label}, but the emoji is
not found")

CHAPTER - 7
TESTING
Chapter 7
TESTING
7.1 Testing Approach
Functional testing has been adopted. Functional testing is a QA process and type of
black-box testing that utilizes the specifications of the software component under test to
generate its test cases. In this method, the functions are tested by providing input and
analyzing the output, with little regard to the program's internal structure. The purpose of
functional testing is to assess the system or component's compliance with functional
requirements. Essentially, functional testing aims to evaluate the system's behavior and
actions.
The test approach that we are following is the Reactive approach. The main
objective of functional testing is to validate the usage of the software and ensure that it is
in compliance with the business requirements, stated by the client before the
commencement of the project.
Various forms of functional testing include:

• Unit Tests: Unit testing tests the individual components of the software and validates
their performance as per the specified requirements.
• Integration Testing: Performed with the assistance of stubs and drivers, integration
testing involves testing of individual units by combining and testing them in a group.
Integration testing aims at exposing issues and defects in the interaction between
various integrated units.
7.1.1 Unit Testing
Unit testing involves testing a single unit or module in its entirety, which includes
testing the individual interactions of multiple functions. Some of the unit testing cases are
shown in table 7.1.1 and table 7.1.2 below:

Table 7.1.1: Unit Testing for module importing
Test Test Pass/
Expected Result Actual Result
# Data(input) Fail
Module Imported Module Imported

1 Import Modules Pass
Successfully Successfully
Importing Module Imported
2 Incompatible versions. Fail
Module Successfully
Importing Module Imported
3 No Module Found. Fail
Module Successfully
7.1.2 Integration Testing

It creates a dedicated testing environment that integrates all the modules and verifies
their interoperability, while also detecting any errors or bugs. This process encompasses
testing the entire application. Model, UI and Backend integration is tested here. Integration
Testing is shown in table 7.1.2.1 below:
Table 7.1.2: Integration Testing

Test
Test Data(Input) Expected Result Actual Result Pass/Fail
#
Loading audio from Audio loaded Audio loaded
1 Pass
database successfully in UI successfully
Loading audio from Audio loaded Audio not found in
2 Fail
database successfully in UI database
Loading audio from Audio loaded Audio file
3 Fail
database successfully in UI corrupted

CHAPTER - 8
RESULTS AND PERFORMANCE ANALYSIS
Chapter 8
RESULTS AND PERFORMANCE ANALYSIS
8.1 Test results
8.1.1 Epoch vs Accuracy graph
In speech emotion recognition using 2D CNN, the epoch vs accuracy graph is a

useful visualization that shows the performance of the model during the training process.
The number of epochs refers to the number of times the entire training dataset is passed
through the model, while accuracy measures how well the model is able to predict the
correct emotion label for a given speech sample. Ideally, the epoch vs accuracy graph
should show an upward trend in accuracy as the number of epochs increases, indicating
that the model is learning and improving its performance. However, it is common to
observe fluctuations in accuracy over the course of training, as the model adjusts its weights
to better fit the training data.
Figure 8.1.1: Epoch vs Accuracy graph
8.1.2 Epoch vs Loss graph
In speech emotion recognition, the epoch vs loss graph is another important

visualization that shows the performance of the model during training. Loss measures the
difference between the predicted output and the actual output, with the goal of minimizing
this difference to increase the accuracy of the model.
The epoch vs loss graph typically shows a downward trend in loss as the epochs
increases, indicating that the model is getting better at minimizing the difference between
the predicted and actual outputs. Similar to the epoch vs accuracy graph, it is common to
observe fluctuations in loss over the course of training.

However, unlike the accuracy metric which measures the percentage of correct
predictions, the loss metric is a continuous value that represents the degree of error in the
predictions. Therefore, the shape of the epoch vs loss graph provides information on how
the model is optimizing its weights to minimize the error during training.
Figure 8.1.2: Epoch vs Loss graph
8.1.3 Confusion matrix
In speech emotion recognition using a 2D CNN-LSTM architecture, a confusion

matrix can be a valuable tool for assessing a machine learning model's performance. It
presents the true positives, true negatives, false positives, and false negatives, which can be
utilized to derive accuracy, precision, recall, and F1 score metrics. By using the confusion
matrix, the model's ability to classify different emotions accurately can be evaluated. It can
help identify the emotions that the model is correctly classifying and those that it is
misclassifying. This information can then be utilized to fine-tune the model and enhance
its performance.
Figure 8.1.3: Confusion Matrix

8.1.4 Loss graph comparison between 1D and 2D CNN
Loss graph comparison between 1D and 2D CNN architectures can provide insights
into the performance of these models in speech emotion recognition tasks. 1D CNNs are
used for processing 1D sequential data such as audio signals, while 2D CNNs are used for
processing 2D spatial data such as images. In the context of speech emotion recognition,
1D CNNs are used to extract relevant features from the speech signal, while 2D CNNs can
be used to extract features from spectrograms or Mel frequency cepstral coefficients
(MFCCs). A comparison of the loss graphs can help identify which architecture performs
better in terms of convergence and accuracy, and can be used to guide the selection of the
appropriate model for the given task.
Figure 8.1.4: Loss Graph Comparison
8.1.5 Accuracy graph comparison between 1D and 2D CNN
Accuracy is an important metric for evaluating the performance of machine learning

models in speech emotion recognition tasks using both 1D and 2D CNN architectures. One
way to compare the performance of these models is to plot their accuracy scores on a graph.
A comparison of the accuracy graphs between 1D and 2D CNN models can provide insights
into which architecture performs better in a given dataset. In general, 2D CNNs are
expected to perform better in image-based tasks, while 1D CNNs are more suited for time-
series data such as speech signals. However, the optimal architecture depends on the
specific dataset and task, and a comparison of the accuracy graphs can help guide the
selection of the appropriate model.

Figure 8.1.5: Accuracy Graph Comparison

8.2 Snapshots
Figure 8.2.1: Audio Recording Progress
Figure 8.2.2: Angry
Figure 8.2.3: Calm

Figure 8.2.4: Disgust
Figure 8.2.5: Fear
Figure 8.2.6: Happy

Figure 8.2.7: Neutral
Figure 8.2.8: Sad

CHAPTER - 9
CONCLUSION, APPLICATIONS AND FUTURE
WORK
CHAPTER 9
CONCLUSION, APPLICATIONS AND FUTURE
WORK
9.1 Conclusion
Speech Emotion Recognition using CNN has shown promising results in accurately
recognizing emotions from speech signals. The use of CNNs for feature extraction allows
for the capture of complex spectral patterns in the input speech signal, while LSTM
networks capture the temporal dependencies in the speech signal. The combination of CNN
and LSTM models provides a powerful framework for recognizing emotions in speech
signals. The performance of the CNN-LSTM model for SER largely depends on the quality
of pre-processing of speech signals and feature extraction techniques. Various pre-
processing techniques, such as normalization, filtering, and segmentation, can be used to
enhance the quality of input data.
9.2 Applications
• SER can be used to monitor the emotions of individuals and provide early diagnosis
and intervention for mental health disorders such as depression and anxiety.
• SER can enhance the interaction between humans and computers by enabling
computers to understand and respond appropriately to human emotions.
• SER can be used in the development of video games, virtual reality experiences, and
other forms of interactive entertainment that respond to the emotional state of the user.
• SER can be used to improve customer service by enabling companies to analyze the
emotional state of their customers and respond accordingly.
9.3 Limitations of the System

• One of the major limitations of SER using CNN is the availability of limited speech
datasets. This limits the ability to generalize the models and the accuracy of the
predictions.
• The variability in speech patterns, accents, and language can affect the performance of
SER models. The models need to be trained on a diverse dataset to capture the
variability in speech patterns.

• Most of the existing SER models focus on a limited set of emotions, such as anger,
happiness, sadness, and neutral. Emotions such as sarcasm, irony, and disgust are
difficult to recognize and need further research.
• SER models can be used for unethical purposes such as monitoring emotions without
consent or breaching privacy.
9.4 Future Scope of the Project

• In SER, online learning can be used to adapt the model to the changing speech patterns
and improve the accuracy and reliability of emotion recognition.
• Multimodal SER combines speech with other modalities such as facial expressions,
body language, and physiological signals.
• In SER, transfer learning can be used to adapt existing models trained on a different
dataset or task to new datasets or tasks with limited data.
• With the increasing use of SER in various domains, ethical considerations such as
privacy, consent, and bias need to be addressed to ensure the responsible and ethical
use of the technology.

REFERENCES
[1] Ms. P. Patel, A. A. Chaudhari, M. A. Pund, Ms. D. H. Deshmukh, “Speech Emotion

Recognition System Using Gaussian Mixture Model and Improvement proposed via
Boosted GMM”, In IRA-International Journal of Technology & Engineering (ISSN
2455-4480), PP: 56-64, 2017
[2] S. K. Bhakre and A. Bang, “Emotion recognition on the basis of audio signal using
Naive Bayes classifier,” In International Conference on Advances in Computing,
Communications and Informatics (ICACCI), PP. 2363-2367, DOI:
10.1109/ICACCI.2016.7732408, 2017
[3] A. Jacob, “Modelling speech emotion recognition using logistic regression and decision
trees”, International Journal of Speech Technology, DOI: 10.1007/s10772-017-9457-
6, 2017
[4] P. P. Dahake, K. Shaw and P. Malathi, “Speaker dependent speech emotion recognition
using MFCC and Support Vector Machine,” In International Conference on Automatic
Control and Dynamic Optimization Techniques (ICACDOT), PP. 1080-1084, DOI:
10.1109/ICACDOT.2016.7877753, 2017
[5] I. Shahin, A. B. Nassif and S. Hamsa, “Emotion Recognition Using Hybrid Gaussian
Mixture Model and Deep Neural Network”, IEEE Access, Vol. 7, PP. 26777-26787,
DOI: 10.1109/ACCESS.2019.2901352, 2019
[6] S. Mao, D. Tao, G. Zhang, P. C. Ching and T. Lee, “Revisiting Hidden Markov Models
for Speech Emotion Recognition,” In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), PP. 6715-6719, DOI:
10.1109/ICASSP.2019.8683172, 2019
[7] A. B. Abdul Qayyum, A. Arefeen and C. Shahnaz, "Convolutional Neural Network

(CNN) Based Speech-Emotion Recognition," In IEEE International Conference on
Signal Processing, Information, Communication & Systems (SPICSCON), pp. 122-125,
doi: 10.1109/SPICSCON48833.2019.9065172, 2019.
[8] B. T. Atmaja, K. Shirai and M. Akagi, “Speech Emotion Recognition Using Speech
Feature and Word Embedding”, In Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), PP. 519-523, DOI:
10.1109/APSIPAASC47483.2019.9023098, 2019

[9] S. Sharma, “Emotion Recognition from Speech using Artificial Neural Networks and
Recurrent Neural Networks”, In 11th International Conference on Cloud Computing,
Data Science & Engineering (Confluence), PP. 153-158, DOI:
10.1109/Confluence51648.2021.9377192, 2021
[10] M. V. Subbarao, S. K. Terlapu, N. Geethika, K. D. Harika, “Speech Emotion

Recognition Using K-Nearest Neighbor Classifiers”, DOI: 10.1007/978-981-16-3342-
3_10, 2021

Phase 2 Report

Uploaded by

Copyright:

Available Formats

Phase 2 Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phase 2 Report

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“Jnana Sangama”, Belagavi-590018, Karnataka

Under the Guidance of

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

EMOTION RECOGNITION USING 2D CNN LSTM NETWORKS”

Project Guide HOD, CSE Principal, BIT

Name of the examiners & signature with date

We would like to thank our project coordinator Dr. Gunavathi H S, Assistant

CHAPTER 2: LITERATURE SURVEY 3-9

CHAPTER 3: REQUIREMENT ENGINEERING 10-18

CHAPTER 5: SYSTEM DESIGN 20-23

CHAPTER 6: IMPLEMENTATION 24-33

CHAPTER 7: TESTING 34-35

CHAPTER 8: RESULTS DISCUSSION AND PERFORMANCE

CHAPTER 9: APPLICATIONS & CONCLUSION 43-44

3.2.1 Use Case Diagram 11

7.1.1 Unit Testing 35

1.3 Purpose, Scope, and Applicability

Dept. of CSE, BIT 2022-23 1

1.4 Organization of Report

• Chapter 3 of this document describes the Requirement Engineering. It includes overall

• Chapter 6 contains the implementation of the project.

• Chapter 7 describes how the testing of the project is done.

• Chapter 8 contains a brief about the results and performance analysis.

Dept. of CSE, BIT 2022-23 2

A literature survey or review is a section that presents an analysis of previous

2.2 Summary of Papers

Dept. of CSE, BIT 2022-23 3

“Emotion Recognition on The Basis of Audio Signal Using Naive Bayes

Drawbacks: Limitations of the proposed SER system include limited emotion

“Modelling speech emotion recognition using logistic regression and decision

Dept. of CSE, BIT 2022-23 4

“Speaker Dependent Speech Emotion Recognition using MFCC and Support

Dept. of CSE, BIT 2022-23 5

“Convolutional Neural Network (CNN) Based Speech-Emotion Recognition”

Dept. of CSE, BIT 2022-23 6

Drawbacks: Some drawbacks of using Convolutional Neural Networks (CNNs) for

“Speech Emotion Recognition Using Speech Feature and Word Embedding”

“Speech Emotion Recognition Using K-Nearest Neighbor Classifiers” [9] This

Dept. of CSE, BIT 2022-23 7

“Emotion Recognition from Speech using Artificial Neural Networks and

Dept. of CSE, BIT 2022-23 8

2.3 Drawbacks of Existing System

2.4 Problem Statement

2.5 Proposed Solution

Dept. of CSE, BIT 2022-23 9

3.1.1 Software Tools

• Seaborn: It provides support for a large number of statistical graphics, including

Dept. of CSE, BIT 2022-23 10

3.1.2 Hardware Tools

• Workable Inbuilt mic

3.2 Conceptual/ Analysis Modeling

Figure 3.2.1: Use Case Diagram

Dept. of CSE, BIT 2022-23 11

Figure 3.2.2 Sequence Diagram for voice input

Figure 3.2.3 Sequence diagram for preprocessing audio

Dept. of CSE, BIT 2022-23 12