Phase 2 Report
Phase 2 Report
Phase 2 Report
Report
on
“SPEECH BASED EMOTION RECOGNITION USING 2D
CNN LSTM NETWORKS”
Submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Engineering
in
Computer Science & Engineering
Submitted by
USN Name
1BI19CS078 KOUSHIK GG
1BI19CS079 LITHISH S
1BI19CS093 NAGARAJ BHAT
1BI19CS100 NIKHIL KORNAYA
USN Name
1BI19CS078 KOUSHIK G G
1BI19CS079 LITHISH S
1BI19CS093 NAGARAJ BHAT
1BI19CS100 NIKHIL KORNAYA
bonafide students of VIII semester in partial fulfillment for the award of Bachelor of
Engineering in Computer Science & Engineering of the VISVESVARAYA
TECHNOLOGICAL UNIVERSITY, Belagavi during the academic year 2022-23. It is
certified that all corrections/suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the departmental library. The project report has been
approved as it satisfies the academic requirements in respect of Project work prescribed for
the said degree.
1.
2.
ACKNOWLEDGEMENT
The knowledge & satisfaction that accompany the successful completion of any
task would be incomplete without mention of people who made it possible, whose guidance
and encouragement crowned our effort with success. We would like to thank all and
acknowledge the help we have received to carry out this project.
We would like to convey our sincere thanks to our college Bangalore Institute of
Technology, Dr. Aswath M U, Principal, for being kind enough to provide an opportunity
and platform to complete and present our final year project “Speech Based Emotion
Recognition Using 2D CNN LSTM Networks”.
We would also like to thank Dr. Girija J., Professor and Head of the Department
of Computer Science and Engineering, Bangalore Institute of Technology, for her
constant encouragement and ultimately present our final year project “Speech Based
Emotion Recognition Using 2D CNN LSTM Networks”.
We are most humbled to mention the enthusiastic influence provided by our guide
Dr. Harish Kumar B T on the project for their ideas, time to time suggestions for being a
constant guide and co-operation showed during the venture and making this phase of the
project fruitful.
We would also take this opportunity to thank our friends and family for their
constant support and help. We are very much pleasured to express our sincere gratitude to
the friendly co-operation showed by all the staff members of Computer Science
Department, BIT.
1BI19CS078 KOUSHIK G G
1BI19CS079 LITHISH S
1BI19CS093 NAGARAJ BHAT
1BI19CS100 NIKHIL KORNAYA
ABSTRACT
Speech emotion recognition (SER) involves the identification of emotions
conveyed in spoken language through analysis of speech signals. With the growing
popularity of smart devices, SER has gained significant attention in recent years. One
approach to SER is to use deep learning models such as Convolutional Neural Networks
(CNNs) and Long Short-Term Memory (LSTM) networks. The purpose of this project is
to introduce a new methodology for performing SER using a 2D CNN-LSTM architecture.
The proposed model first uses a 2D CNN to extract the relevant features from the speech
signal, followed by an LSTM network for sequence modeling. We evaluated our model on
the Berlin Emotional Speech Database (EMO-DB), achieving state-of-the-art results. We
also compared our model's performance with other existing SER models and found that the
outcomes of our experiment outperformed them. The results of our project demonstrate that
the proposed 2D CNN-LSTM architecture is an efficient method for SER and can be used
in real-world applications such as emotion recognition from voice assistants, call centers,
and customer service applications.
TABLE OF CONTENTS
Page no.
CHAPTER 1: INTRODUCTION 1-2
1.1 Overview 1
1.2 Objectives 1
1.3 Purpose, Scope, and Applicability 1
1.3.1 Purpose 1
1.3.2 Scope 1
1.3.3 Applicability 2
1.4 Organization of Report 2
REFERENCES 45-46
LIST OF FIGURES
Figure No. Figure Name Page No.
1.2 Objectives
• To accurately identify the emotions of the speaker
• To enhance human computer interaction.
• To detect changes in human emotional states.
1.3.2 Scope
The scope of Speech Emotion Recognition (SER) is vast, and its applications are
numerous. SER can be used in mental health assessments, speech-enabled virtual assistants,
human-robot interaction, customer service, entertainment, and more. The potential
applications of Speech Emotion Recognition (SER) are vast and diverse, limited only by
our ability to imagine new use cases and possibilities. With the increasing availability of
1.3.3 Applicability
Speech Emotion Recognition has broad applicability in mental health, virtual
assistants, customer service, entertainment, education, and more. It enables machines to
detect and respond to human emotions in a more natural and empathetic way, improving
human-machine interaction and leading to new applications and innovations.
• Chapter 2 of this document describes the Literature Survey. It provides details about
the existing system, the limitations that the existing system experiences and the
proposed system for the project.
• Chapter 4 is the Gantt Chart which is a bar chart showing the project schedule.
• Chapter 5 gives information about the system architecture, system design, interface
design and algorithm design.
• Chapter 9 describes the Conclusion, Applications, Limitations and Future Work of the
project.
LITERATURE SURVEY
2.1 Introduction
The field of Speech Emotion Recognition (SER) encompasses the creation of
computational models and algorithms that enable the identification and analysis of
emotions conveyed through spoken language by examining speech signals. SER is an
interdisciplinary field that draws upon expertise from speech processing, machine learning,
psychology, and neuroscience. The goal of SER is to enable machines to detect, understand,
and respond to human emotions expressed through speech. This can have a significant
impact on many fields, including mental health assessments, speech-enabled virtual
assistants, customer service, education, entertainment, and more. SER can help improve
mental health care by providing clinicians with a new tool for assessing and monitoring
changes in patients' emotional states. It can also enhance the user experience of virtual
assistants and chatbots by enabling them to interact with users in a more personalized and
empathetic way. Furthermore, SER can be used to develop emotion-based content filtering
and recommendation systems, creating new opportunities in the entertainment industry. As
the availability of speech data and advances in machine learning and natural language
processing techniques continue to grow, the potential of SER is vast, leading to new and
innovative applications that can improve human-machine interaction and our
understanding of human emotions.
Drawbacks of the proposed SER system include dependency on training data quality,
computational complexity, limited emotion classification, and sensitivity to noise and
variability.
Drawbacks: The study has limited evaluation metrics and uses only one dataset, leading
to potential overfitting. Comparison with other models is also insufficient, and the feature
set used is limited.
Drawbacks: The study only uses one dataset and evaluation metric, potentially limiting
generalizability. The speaker-dependent approach requires individual training, which may
not be practical for large-scale applications.
Drawbacks: The study only uses one dataset, and the evaluation metrics used are not as
comprehensive as those in other studies. The hybrid approach requires training two separate
models, which may not be practical for real-time applications.
“Hidden Markov Models For Speech Emotion Recognition” [6] This study
explores the use of Hidden Markov Models (HMM) for speech emotion recognition, using
the Emotional Prosody Speech and Transcripts (EPST) database. The study uses the Mel-
frequency cepstral coefficients (MFCC) as features and trains a separate HMM model for
each emotional state. The results show that HMM achieved an overall accuracy of 74.6%
for six emotions, outperforming other machine learning algorithms such as k-Nearest
Neighbors and SVM. The study suggests the potential of HMM for speech emotion
recognition and encourages further research to improve its performance with larger datasets
and more complex models.
Drawbacks: The study only uses one dataset and evaluation metric, which limits
generalizability. The use of separate models for each emotional state may not be practical
for real-time applications. The accuracy achieved, while high, is not as comprehensive as
in other studies that use multiple evaluation metrics.
Drawbacks: Some drawbacks of using speech features and word embedding for speech
emotion recognition include their limited ability to capture contextual information, the need
for domain-specific language models, and the difficulty of representing emotions that are
expressed through non-verbal cues. Additionally, these methods may not be suitable for
real-time emotion recognition applications due to their high computational cost and the
need for significant processing power.
Drawbacks: Some drawbacks of using K-Nearest Neighbor (KNN) classifiers for speech
emotion recognition include their sensitivity to irrelevant features, the need for large
amounts of labeled data, and the difficulty of optimizing the distance metric used for
classification. Additionally, KNN classifiers may not be suitable for real-time applications
due to their high computational cost and the need for significant memory resources.
Drawbacks: Some drawbacks of using Artificial Neural Networks (ANNs) and Recurrent
Neural Networks (RNNs) for speech emotion recognition include the difficulty of
interpreting the learned representations, the need for large amounts of labeled data for
training, and the tendency to overfit the training data. Additionally, RNNs may suffer from
vanishing gradients during training, which can lead to difficulties in capturing long-term
dependencies in the data.
CHAPTER 3
REQUIREMENT ENGINEERING
3.1 Software and Hardware Tools used
Some of the specific requirements of the proposed system are:
• Streamlit: It allows developers to create web apps with minimal setup, using Python
code to create interactive data visualizations, dashboards, and more.
• Pyaudio: It provides a simple and flexible interface for capturing and playing back
audio data in real-time, making it a better option for speech recognition and music
processing applications.
• Wave: It supports reading and writing of uncompressed WAV files and compressed
formats like MP3 and Ogg Vorbis, making it a versatile tool for audio file manipulation.
• Numpy: It provides support for many mathematical operations on arrays and matrices,
including linear algebra, Fourier transforms, and random number generation.
• TensorFlow: It includes different high-level APIs for building and training neural
networks, and support for distributed training and deployment on different platforms.
• Sklearn: It provides many machine learning algorithms, evaluation metrics and data
preprocessing tools for building and evaluating predictive models.
• Pandas: It provides support for data cleaning, reshaping, and analysis, including
powerful tools for data aggregation and grouping.
• Pathlib: It provides an object-oriented interface for working with file paths and
directory operations, making it easier to write portable and platform-independent code.
• Librosa: It provides a variety of tools for manipulating audio data, including support
for time-frequency analysis, feature extraction, and spectral processing.
• Matplotlib: It provides many visualization tools, including scatter plots, line plots, and
3D graphics, with support for customization of every aspect of the visualization.
• Spyder: It provides a powerful and flexible IDE for working and code, including
support for code completion, and debugging tools.
Figure 3.2.1 represents the use case diagram of SER, where user inputs audio to the
algorithm to generate the identified output. The use case and actors are represented. An use
case is represented by eclipse shape namely voice input, pre-processing, feature learning
and graphical result.
In the figure, the User initiates the process by speaking and recording their speech
data. The System then analyzes the speech data, extracts features, and classifies emotions.
Finally, the System returns the emotion result to the User. This diagram shows the overall
flow of the SER process, with arrows indicating the direction of communication between
the User and the System.
In Figure 3.2.2 sequence diagram, the audio data is first recorded using an audio
recording device, such as microphone. The audio data is then passed to a Pre-processing
object, where any background noise is removed and the data is filtered and normalized. The
pre-processed audio data is then passed to a Feature Extraction object, where features such
as spectral characteristics, energy, and pitch are extracted. Finally, the extracted features
are passed to an Emotion Classification object, which classifies the speaker's emotion.
In Figure 3.2.3 sequence diagram, shows the flow of messages between the objects
involved in the audio preprocessing stage before speech emotion recognition. It highlights
the importance of preprocessing audio data before performing SER, and shows the
dependencies between the pre-processing, feature extraction, and emotion classification
steps.
The trained model is then used to extract learned features from the audio data.
Finally, the extracted features are passed to an "Emotion Classification" object, which
classifies the speaker's emotion.
Figure 3.2.4 sequence diagram shows the flow of messages between the objects
included in the feature learning process in SER. It highlights the significance of feature
learning in accurately classifying the speaker's emotion, and shows the dependencies
between the pre-processing, feature extraction, feature learning, and emotion classification
steps.
In Figure 3.2.4 sequence diagram, the Output object produces a classification result
indicating the speaker's emotion. The classification result is then passed to an Emotion
Display object, which displays the result to the user through a graphical user interface.
The user then does interaction with the system by providing feedback or performing
an action. The user's action is captured by the "User Interface" object, which then performs
the corresponding action.
Figure 3.2.5 diagram shows the movement of messages between the objects
involved in displaying the result of SER to the user. It highlights the importance of
providing a clear and intuitive user interface.
This diagram shows the different tasks that take place during the SER process and
the flow between them. The diagram also shows that the result of the SER process is a
single output: the emotion classification result.
The above shown class diagram, shown in figure 3.2.4, shows the different tasks
that take place during the SER process and the flow between them. It emphasizes the
sequence of actions that appear during the SER process, and the dependencies between
them.
• Accurate analysis
The proposed system should accurately analyze the given dataset and predict proper
colors for the images in the dataset without any/with minimum fatal errors.
• Data Pre-processing
Data preprocessing refers to the steps taken to clean, transform, and prepare raw data
for analysis. In functional requirements, data preprocessing may be included as a necessary
step to achieve specific software functionalities.
• Training
Learning how to perform the required task based on the inputs given through the
dataset. In the functional requirements, training refers to training a machine learning model
to do a specific task or accomplish a specific goal.
• Forecasting
Making predictions of the future based on past and present data by analysis of trends.
It refers to the capacity of a software system to make predictions or forecasts based on
historical data or other relevant factors.
• Performance
The training time should be significantly reduced by using parallel processing of the
distributed dataset.
• Portability
The system must be possible to run on many systems without doing a lot of changes.
• User Friendly
As the main goal is to provide an end-to-end user interface, it should be easy for users
to use the WebApp and process the images.
• Reliability
The system has to produce fast and accurate results.
• Emotion recognition
Emotion recognition refers to the capacity of a software system to detect and identify
emotions or affective states in speech, text, images, or other types of input. In speech
emotion recognition (SER), emotion recognition refers specifically to the capacity to detect
and identify emotions or affective states in speech signals.
• Integration
Integration refers to the capacity of a software system to work together with other
software or hardware systems. In speech emotion recognition (SER), integration might
refer to the capacity of a SER system to integrate with call center software, audio processing
equipment that are used in the domain or context where SER system will be deployed.
Henry Gantt developed a visual representation of a project schedule called the Gantt
chart. This uses bars to show the start and end dates of terminal and summary elements
within the project's work breakdown structure. Simply put, Gantt charts display the timeline
of a project's tasks and milestones.
The following figure 4.1 is the Gantt chart of our project “Speech Based Emotion
Recognition Using 2D CNN LSTM Networks”
Data Collection: Speech emotion recognition (SER) using CNN-LSTM typically requires
a dataset of audio recordings that are labelled with the corresponding emotions expressed
by the speakers. The dataset should be diverse in terms of speakers, languages, accents, and
emotions to ensure that the trained model can generalize well to unseen data.
Pre-processing: The pre-processing step involves converting the raw audio recordings into
a suitable format that can be fed into the CNN-LSTM model for training and evaluation.
Here first step is extracting relevant features from raw audio recordings is a critical step in
SER. Next step is normalization where the scaling the input features is to a common range.
Training: The training data consists of audio recordings that are labeled with the emotions
expressed by the speakers. The goal of training is to build a model that learn to recognize
the patterns in the audio data associated with different emotions. The training data is
typically pre-processed to extract relevant features. After pre-processing, the data is
partitioned into separate training and validation sets.
• Data Preprocessing
This module involves pre-processing the raw audio data to extract relevant features,
such as Mel-frequency cepstral coefficients (MFCCs), which are commonly used in speech
processing tasks. The pre-processed data is then split into training and validation sets.
• Output Layer
The output of the LSTM is fed into a fully connected layer, which maps the input to the
corresponding emotion labels. The output layer typically uses a softmax activation function
to output the probabilities of input belonging to each of the possible emotion categories.
Frontend design includes a homepage which contains start and end recording
buttons. When the recording is ended, a new predict emotion button appears and the audio
file that is being recorded is stored in the storage. The model predicts the emotion from the
stored audio and displays the corresponding emotion in the frontend along with the label.
• Outputs
o Trained CNN-LSTM model
o Load the speech dataset and extract features (e.g., Mel frequency cepstral
coefficients, log-mel spectrogram, etc.) from the audio files.
• Partition the dataset into distinct training, validation, and test sets.
• Define the CNN-LSTM model architecture:
o Define the CNN layers to extract temporal features from the audio signals.
o Define the LSTM layer to model the temporal dependencies in the feature sequence.
o Define the output layer and it contains softmax activation to classify the emotion
into one of predefined categories (e.g., happy, sad, angry, etc.).
dataload2d.py
import librosa
import pathlib
import numpy as np
from sklearn.model_selection import traintest_split
def get_log_mel_spectrogram(path, n_fft, hop_length, n_mels):
y, sr = librosa.load(path, sr=16000, duration=8)
file_length = np.size(y)
if file_length != 128000:
cnn2d.py
from tensorflow import keras
from tensorflow.keras import layers
def model2d(input_shape, num_classes):
model = keras.Sequential(name='model2d')
#LFLB1
model.add(layers.Conv2D(filters=64,
kernel_size=3,
strides=1,
padding='same',
# data_format='channels_first',
input_shape=input_shape
)
)
model.add(layers.BatchNormalization())
model.add(layers.Activation('elu'))
model.add(layers.MaxPooling2D(pool_size=2, strides=2))
#LFLB2
model.add(layers.Conv2D(filters=64,
kernel_size=3,
strides=1,
padding='same',
)
)
model.add(layers.BatchNormalization())
model.add(layers.Activation('elu'))
model.add(layers.MaxPooling2D(pool_size=4, strides=4))
#LFLB3
model.add(layers.Conv2D(filters=128,
kernel_size=3,
strides=1,
train2d.py
import cnn2d
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import matplotlib.pyplot as plt
def train(train_x, train_y, validation_x, validation_y):
model = cnn2d.model2d(input_shape=(128, 251, 1), num_classes=7)
es = EarlyStopping(monitor='val_loss',
mode='min',
test2d.py
from tensorflow.keras.models import load_model
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
import pandas as pd
import seaborn as sns
def test(test_x, test_y, x):
Recorder.py
import pyaudio
import wave
import streamlit as st
def recording():
CHUNK = 1024
FORMAT = pyaudio.paInt16 #paInt8
CHANNELS = 2
RATE = 44100 #sample rate
WAVE_OUTPUT_FILENAME = "C:/Users/koush/.spyder-py3/Speech Emotion
Recognition/res/audios/audio.wav"
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK) #buffer
frames = []
if 'frames' not in st.session_state:
st.session_state.frames = ''
flag = False
if st.button("Stop Recording"):
flag = True
st.markdown('''<p style='text-align: center; color: black;'>Recorded</p>''',
unsafe_allow_html=True)
while not flag:
Prediction.py
import dataload2d
import streamlit as st
import numpy as np
from tensorflow.keras.utils import normalize
from tensorflow.keras.models import load_model
import wave
def prediction():
audio_file = 'C:/Users/koush/.spyder-py3/Speech Emotion
Recognition/res/audios/audio.wav'
try:
with wave.open(audio_file, 'rb') as audio:
audio_data = audio.readframes(-1)
log_mel_spectrogram_new = dataload2d.get_log_mel_spectrogram(audio_file,
n_fft=2048, hop_length = 512, n_mels=128)
X_new = np.array(log_mel_spectrogram_new).reshape(-1, 128, 251, 1)
X_new = normalize(X_new)
model = load_model('C:/Users/koush/.spyder-py3/Speech Emotion
Recognition/res/models/model2d.h5')
emotion_pred = np.argmax(model.predict(X_new))
except FileNotFoundError:
The test approach that we are following is the Reactive approach. The main
objective of functional testing is to validate the usage of the software and ensure that it is
in compliance with the business requirements, stated by the client before the
commencement of the project.
Unit testing involves testing a single unit or module in its entirety, which includes
testing the individual interactions of multiple functions. Some of the unit testing cases are
shown in table 7.1.1 and table 7.1.2 below:
The epoch vs loss graph typically shows a downward trend in loss as the epochs
increases, indicating that the model is getting better at minimizing the difference between
the predicted and actual outputs. Similar to the epoch vs accuracy graph, it is common to
observe fluctuations in loss over the course of training.
Loss graph comparison between 1D and 2D CNN architectures can provide insights
into the performance of these models in speech emotion recognition tasks. 1D CNNs are
used for processing 1D sequential data such as audio signals, while 2D CNNs are used for
processing 2D spatial data such as images. In the context of speech emotion recognition,
1D CNNs are used to extract relevant features from the speech signal, while 2D CNNs can
be used to extract features from spectrograms or Mel frequency cepstral coefficients
(MFCCs). A comparison of the loss graphs can help identify which architecture performs
better in terms of convergence and accuracy, and can be used to guide the selection of the
appropriate model for the given task.
8.2 Snapshots
9.2 Applications
• SER can be used to monitor the emotions of individuals and provide early diagnosis
and intervention for mental health disorders such as depression and anxiety.
• SER can enhance the interaction between humans and computers by enabling
computers to understand and respond appropriately to human emotions.
• SER can be used in the development of video games, virtual reality experiences, and
other forms of interactive entertainment that respond to the emotional state of the user.
• SER can be used to improve customer service by enabling companies to analyze the
emotional state of their customers and respond accordingly.
[3] A. Jacob, “Modelling speech emotion recognition using logistic regression and decision
trees”, International Journal of Speech Technology, DOI: 10.1007/s10772-017-9457-
6, 2017
[4] P. P. Dahake, K. Shaw and P. Malathi, “Speaker dependent speech emotion recognition
using MFCC and Support Vector Machine,” In International Conference on Automatic
Control and Dynamic Optimization Techniques (ICACDOT), PP. 1080-1084, DOI:
10.1109/ICACDOT.2016.7877753, 2017
[5] I. Shahin, A. B. Nassif and S. Hamsa, “Emotion Recognition Using Hybrid Gaussian
Mixture Model and Deep Neural Network”, IEEE Access, Vol. 7, PP. 26777-26787,
DOI: 10.1109/ACCESS.2019.2901352, 2019
[6] S. Mao, D. Tao, G. Zhang, P. C. Ching and T. Lee, “Revisiting Hidden Markov Models
for Speech Emotion Recognition,” In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), PP. 6715-6719, DOI:
10.1109/ICASSP.2019.8683172, 2019
[8] B. T. Atmaja, K. Shirai and M. Akagi, “Speech Emotion Recognition Using Speech
Feature and Word Embedding”, In Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), PP. 519-523, DOI:
10.1109/APSIPAASC47483.2019.9023098, 2019