CHAPTER 1

CHAPTER 1

INTRODUCTION
Video surveillance has become very important in aspects of safety and

security across various forms of environments within public spaces,
commercial setups, and private property. Surveillance systems are mounted
to continue monitoring locations and capture real-time video footage for
immediate review or on storage for later review. These can serve as a critical
resource for detecting potential security threats-including unauthorized
access, and suspicious activity, all the way up to theft-and may accelerate
the rate at which security officers respond to an incident. There is, however,
a major limitation to traditional forms of video surveillance: their reliance
upon human operators reviewing video feeds. It will also likely cause
operator fatigue, missed events, and slower response times over vast areas
and long periods of monitoring multiple feeds and screens especially in
scenarios with very large, complex camera coverage. Heavy reliance on
human oversight creates an unyielding challenge to uniform, precise, and
rapid threat detection.
One of the basic challenges in surveillance is anomaly detection, or

unusual events that do not fit the typical pattern of behavior. Anomalies
might include unauthorized entry, unexpected crowd behavior, or erratic
actions that might stand out against routine movements. Detecting such
anomalies is essential because they most often signal potential security
threats. This detection system, relying merely on human monitoring, is time
consuming and prone to errors in dynamic and complex environments with
frequently changing patterns.
1
An increase in demand has been more rapidly made for an automated
surveillance system that can quickly flag such anomalies. Such a system will
enable the security personnel to act with utmost speed and reliability.
Recent advances in machine learning, particularly neural networks,

could be considered as a means of realizing such a system of automated
anomaly detection. They learn behavioral patterns and anomalies appearing
in a video directly, only by training with video data without any explicit
programming. In particular, CNN is strong in extracting spatial features
across individual frames: shapes, edges, textures of objects and people. For
sequences in time, capable of capturing the temporal dynamics required for
anomaly detection, Long Short-Term Memory networks are perfect.
Combining CNNs with LSTMs, it is then possible to build a model that can
understand the spatial and temporal aspects of video feeds, hence very
effective in surveillance applications.
This involves the design and implementation of a sophisticated, real-

time anomaly detection system by making use of a hybrid of CNNs and
LSTMs within the framework of an autoencoder. An autoencoder is a type
of neural network that learns compressed representations of data by
encoding and then reconstructing the input. By training the autoencoder on
video data with normal behavior, it develops a model that can represent or
reconstruct typical patterns well. Where the function performed is unusual,
the inter-frame difference increases between the original frame and the
reconstructed frame. Reconstruction error-high frames, therefore, raise an
alarm in the security people, onto these frames where further investigations
take place.
2
The UCSD dataset, which forms a video data collection with normal
and anomalous behaviors within it, forms the foundational base to train and
test the model. The model is good at picking patterns within that dataset in
such a manner that generalizability to new footage exists, thus it can identify
abnormal patterns quite successfully with minimal false positives. It also
avails visual feedback in terms of the detection of anomalies through
bounding boxes that give personnel sufficient time for the potential threat to
be weighed.
This method has several benefits. Real-time processing introduces

alert mechanisms whenever anomalies are detected, so that security
personnel can immediately address the matter at hand. In addition, reliability
and very low false-alarm rate provide for a rugged device that may be used
in a wide range of surveillance applications, drastically cutting requirements
for the human eye in monitoring tasks. The current project has huge potential
in making video-surveillance systems more efficient, reliable, and
responsive. Future refinements on this model will be towards sensitivity in
slight anomalies and towards adaptability in different environmental
conditions, for instance, in low-light conditions or very crowded areas. The
system this way presents an emergent scalable and intelligent solution to the
modern challenge in security solutions with more efficiency of an automated
surveillance system in a wide scale of settings.
The future development of automated surveillance systems would be

more dependent on enriching the concept of anomaly detection using higher
architectures of neural networks rather than the present architecture,
encompassing all architectures that introduce attention mechanisms and
learning via transfer for improvements towards better accuracy and
3
generalization across environments. Integration with additional sensors such
as thermal cameras and audio combined with multi-modal systems may give
a more holistic view of the existence of threats under low lighting or
crowded areas. Reinforcement learning, theoretically, would actually hone
the system further, thus learning by self-improvement through actual
feedbacks from the real world. In fact, the whole aspect of cloud and edge
computing will really stand out, enabling scalable real-time processing.
Further development to make AI models more interpretable will continue
explaining the decision-making process. Such innovation will make
automated surveillance more proactive than reactive by preventing incidents
before they actually occur, with minimum human oversight required.
4
CHAPTER 2
LITERATURE SURVEY
Recent advances in video anomaly detection led to two-stage self-

training approaches for generating high-confidence pseudo-labels over
video snippets, thereby rephrasing weakly supervised anomaly detection as
a supervised learning problem with noisy labels. In this regard, Althubiti et
al. further optimized their model using the LSTM network with a focus on
the hyperparameters, which consisted of the number of hidden layers and
dropout rates to better optimize the model in terms of anomaly detection.
Their approach is an iterative refinement of the anomaly classifier, thus
establishing that recurrent architectures are a good fit for the underlying
temporal dynamics.
In parallel, efforts have been made utilizing MIL to improve the

quality of the training process. For instance, Kwon et al. used the graph
convolution network for refining pseudo-labels iteratively improving the
classifier for anomaly detection. This method improves the detection
performance and takes care of the essential hyperparameters concerning
learning rates and graph parameters. Feng et al. developed a multi-instance
pseudo-label generation technique to fine-tune feature encoders when
creating task-specific discriminative features. According to this suggestion,
optimization of the encoder dimensions and the schedule used for learning
led to a more robust model output.
Besides, Ganesh et al. introduced Multi-Sequence Learning (MSL),

which adaptively optimizes reduced sample lengths to sharpen localization
boundaries. Their approach very clearly shows the importance of tuning
hyperparameters like sequence length and sampling rates on the anomaly
5
localization accuracy. Meanwhile, Dhole et al. proposed a convolutional
spatiotemporal autoencoder for feature extraction in video sequences. Their
convolutional filter size hyperparameter modifications and the adopted
pooling strategies were relevant to achieve better temporal feature
extraction, which is very essential for anomaly detection.
In the multimodal information domain, Babanne et al. proposed

integrating visual and audio cues to strengthen video anomaly detection
frameworks. Their method required careful tuning of parameters related to
feature fusion, demonstrating that the alignment of different modalities can
enhance the system's ability to identify complex anomalous events. Wu et
al. had further developed this foundation through HL-Net; synthesizing
appearance, motion, and audio combines the three for a more thorough
multimodal approach, bringing to the surface the need to optimize
hyperparameters in seeking higher accuracy across different datasets.
This more and more comes into play as the field advances into the fact
that although diverse methods have highly explored modelling of the
temporal relation, most methods depend on parallel branches to introduce
more parameters, and thus to increase their computational costs. That is seen
in [9] where Vinayakumar et al proposed an innovative model that could
combine CNNs with LSTMs for bettering temporal dynamics. Their
hyperparameters of number of convolutional layers and sequences lengths
do highlight the appeal of fine-tuned models that not only are more accurate
but also less computationally expensive.
Recent research has also pivoted towards refining anomaly detection

methodologies through innovative architectures and frameworks. The work
by Ergen and Kozat emphasizes unsupervised learning methods with LSTM
neural networks, tuning parameters to improve temporal feature learning.
6
CHAPTER 3
RESEARCH METHODOLOGY
3.1 Statement of the Problem

Traditional video surveillance systems generally rely on human
surveillance, which is highly susceptible to human error and the fatigue that
may accompany it, especially in cases where there is a need to monitor
countless feeds as well as vast areas. The challenges and limitations that
these human systems would present entail incorrect detection and
identification of unusual events - such as unauthorized access or suspicious
behavior - in real-time without the factor of human error. To address this,
our project employs a machine learning-based anomaly detection system
through autoencoders and LSTMs. The system is self-learning, which means
that normal behavior patterns will automatically be learned and any
deviations flagged. This allows for effective, accurate, and real-time
detection of security threats, significantly reducing reliance on human
monitoring.
3.2 Scope of the study

This is developing an anomaly detection system for video surveillance
security monitoring using the machine learning approach. The detection
system trains an autoencoder neural network to recognize normal and
anomalous behavior within feeds from video. The study falls into:
Data Processing and Model Training: A massive dataset of
surveillance videos, frame extraction, resizing, normalization, and feature
extraction using CNNs for spatial analysis and LSTM networks for temporal
analysis are addressed in this study.
7
Real-time Anomaly Detection: Apply an autoencoder model that
detects anomalies in real-time with very low false positives and alerts are
sent in a timely manner as well. Monitor efficiently also.
System Evaluation and Optimization: Test the performance of the
model across different environments, further tune parameters to best
precision and check the results via accuracy, precision, recall, and F1 score.
The system is intended for automation with efficiency improvement
in the detection of anomalies that may be observed within public or office
settings or even industrial zones.
3.3 Objective of the study

Accurate Anomaly Detection
Building a system that identifies unusual activities in surveillance
footage with high accuracy, minimizing human intervention.
Reduce False Positives
Improve anomaly detection precision using Autoencoders, making
the system more reliable in diverse and complex environments.
Real-Time Detection and Visualization
Enable real-time detection and clear visualization of anomalies,
enhancing security operations with quick, actionable insights.
3.4 Realistic Constraints

Data Limitations: Our system would draw an enormous benefit from
a vast amount of high-quality divergent video footage for training. The data
must contain coverage of a lot of various cases; otherwise, the system is very
likely to commit an important mistake – it fails to identify the anomaly.
8
Resource Demands: It requires pretty serious powerful hardware,
good quantities of memory, and high-speed GPUs for deployment. It might
be a heavy load. Such challenges arise especially in real-time applications.
Speed vs. Accuracy: This can be one of the toughest criteria to deal
with: fast, real-time detection with high accuracy. Reliability in video frame
processing does not necessarily entail speed.
Error: This surveillance fails to detect a threat if there are humans,
or even identifies normal activities as threats - mostly occurs especially in
crowded environments or complex scenarios where such an error can hardly
be minimized.
Varied Scenes: It confuses with changes in lighting, or diverse angles
of view, or different crowd sizes, etc. It is hard to adapt to such diversity and
be reliable.
Compatibility: This highly sophisticated system is going to be
installed within already existing security infrastructure. Its use ought to be
very smooth and fit well without much disruption in the present layout.
Cost and Scaling: Scalability for large areas is expensive, and scaling
up to work with multiple locations without consuming a lot of resources is
the big challenge.
3.5 Engineering Standards

Data security and privacy:
The video data generated and processed would be made to adhere to
all the international standards of data security and privacy so that sensitive
information never reaches some unauthorized channel or breaches.
Software quality assurance:
Adherence to international standards in the quality of software
focusing above all, on attributes like reliability, performance efficiency,
9
maintainability and usability so that the system functions fine in real-time
environments.
Design of the Machine Learning Model:
Follows the best practices related to ethical considerations in machine
learning including fairness, transparency, and accountability while training
and at the deployment time of the models toward anomaly detection.
Video Compression and Transmission:
Ensuring video compression efficiency so data can be transmitted
real-time without loss of quality in video processing and monitoring.
10
CHAPTER 4
DESIGN AND METHODOLOGY
4.1 Theoretical Analysis

Concept of Anomaly Detection: Theoretical foundations of anomaly
detection are based on identifying events or behaviors that deviate
significantly from what is considered normal. The study leverages machine
learning models to learn and recognize these patterns effectively.
Machine Learning Techniques: The use of Convolutional Neural
Networks (CNNs) and Long Short-Term Memory (LSTM) networks
enables the system to extract spatial and temporal features from video
frames. This dual approach ensures a comprehensive understanding of both
the visual and sequential aspects of the data.
Autoencoder-Based Framework: Autoencoders are utilized to
compress and reconstruct input data. The model's inability to perfectly
reconstruct anomalous events, reflected by high reconstruction errors, forms
the basis for anomaly detection.
4.1.1 Module
The system consists of key modules: data preprocessing, feature
extraction, model training, anomaly detection, and visualization.
Data Preprocessing Module: Extracts and processes video frames
for uniformity and normalization.
Feature Extraction Module: Utilizes Convolutional Neural
Networks (CNNs) to extract spatial features and LSTM networks for
analyzing temporal sequences.
Model Training and Anomaly Detection: Autoencoders train on
normal behavior to identify unusual events through reconstruction errors.
11
Visualization Module: Highlights detected anomalies using
bounding boxes and presents clear alerts.
4.1.2 Methodology
The methodology for this project involves several critical stages, as
represented in the block diagram. The entire process can be broken down
into the following detailed steps, starting from the input video dataset to the
final output of anomaly detection:
A. Input Video Dataset (UCSD Dataset)
The input to the system consists of videos from the UCSD dataset.
This dataset includes surveillance videos recorded in various environments,
containing both normal and anomalous behaviours.
Table 4.1.2 (a): Details of UCSD Pedestrian Dataset (Ped1 and Ped2)
Dataset No. of Videos No. of Videos Resolution Anomalous Events
(Training) (Testing)
People walking
UCSD Ped1 34 36 238 × 158 outside designated
paths, bikes, cars,
etc.
Similar anomalies,
UCSD Ped2 16 12 360 × 240 such as non-
pedestrian objects
on walkways.
This table summarizes the UCSD Ped1 and Ped2 datasets, detailing
the number of training and testing videos, resolutions, and types of
anomalous events. Anomalies include people walking outside designated
paths, vehicles, and non-pedestrian objects on walkways.
12
B. Convert Videos to Frames (Frame Extraction)
The videos are split into individual frames. Each frame is treated as a
separate image for further processing.
Reason: Video anomaly detection works at the frame level, as processing
each frame individually allows the model to detect sudden irregularities.
Table 4.1.2 (b): AUC Comparison Across Datasets for Different Models
Dataset Our Model STG-NF Jigsaw

(AUC) (AUC) (AUC)
USCD Ped2 99.7% 93.07% 98.88%
CUHK Avenue 92.8% 60.90% 91.41%
ShanghaiTech 87.72% 85.93% 84.26%
UBnormal 69.88% 71.78% 55.57%
This table compares the AUC performance of three models: Proposed

Model, STG-NF, and Jigsaw, across four datasets: UCSD Ped2, CUHK
Avenue, ShanghaiTech, and Ubnormal. It highlights the superior AUC of
Our Model on most datasets, particularly 99.7% on UCSD Ped2 and 92.8%
on CUHK Avenue.
C. Preprocessing
Resize and Normalize: The extracted frames are resized to a standard
dimension (e.g., 128x128 pixels), which ensures that all frames are uniform
in size.
13
Normalization: Pixel values are normalized to a range between 0 and 1
to speed up convergence during model training.
𝑥 − 𝑚𝑖𝑛(𝑥 )
𝑥 = (4.1)
𝑚𝑎𝑥 (𝑥 ) − 𝑚𝑖𝑛(𝑥 )
where x is the original pixel value.
Data Augmentation: Various transformations such as flipping, rotating,
and cropping are applied to the frames to create variations in the training
data. This helps the model generalize better.
D. Feature Extraction
Convolutional Neural Networks (CNN): CNNs are applied to the
frames to extract spatial features. The layers of CNNs perform convolution
operations to detect patterns such as edges, corners, and textures in the
images.
Layers:
• Convolution Layer: Detects low-level features using filters.
• Activation Function (ReLU): Introduces non-linearity into the
model.
• Pooling Layer: Reduces the spatial dimensions of the feature maps.
𝑓(𝑝, 𝑞 ) = 𝛽(𝑎, 𝑏). 𝛾 (𝑝 + 𝑎, 𝑞 + 𝑏) (4 .2)
where 𝛽 is the filter/kernel, and 𝛾 is the input image.

Principal Component Analysis (PCA): PCA is applied to reduce the
dimensionality of the extracted features while preserving most of the
variance. This helps reduce the computational load and prevents overfitting.
Z = XW (4.3)
where X is the data matrix, and W is the matrix of eigenvectors.
14
E. Model Training
Autoencoder: An autoencoder is used to learn a compressed
representation of the normal data. It consists of two main parts:
• Encoder: Compresses the input into a latent space representation.
• Decoder: Reconstructs the input from the compressed data.
L = ∥ 𝐴 − 𝐴̅ ∥ (4.4)
where A is the input, and 𝐴̅ is the reconstructed output. The loss function L
minimizes the reconstruction error.
Table 4.1.2 (c): Average Reconstruction Error Comparison for Normal and
Anomalous Events
Average Reconstruction
Average Reconstruction
Dataset Error
Error (Anomalous)
(Normal)
ShanghaiTech
0.0032 0.0467
USCD Ped2 0.0025 0.0354
CUHK Avenue 0.0048 0.0491
This table shows the average reconstruction error for normal and
anomalous events across three datasets: ShanghaiTech, UCSD Ped2, and
CUHK Avenue. Anomalous events consistently have a higher
reconstruction error, indicating the model's effectiveness in distinguishing
between normal and abnormal behaviour.
LSTM (Long Short-Term Memory): LSTM networks are used to
capture temporal patterns between frames. This is crucial for video data as
anomalies may occur across consecutive frames.
15
𝑎 = 𝜎(𝜔 . [𝑔 ,𝑥 ] + 𝑝 ) (4.5)
𝑏 = 𝜎 ( 𝜔 . [𝑔 ,𝑥 ] + 𝑝 ) (4.6)
𝐶 = tanh(𝜔 . [𝑔 ,𝑥 ] + 𝑝 ) (4.7)
𝐶 = 𝑎 ∗ 𝐶 + 𝑖 ∗ 𝐶 (4.8)
Where at is the forget gate, bt is the input gate, Ct is the candidate cell
state.
F. Anomaly Detection
Anomalies are detected based on the reconstruction error. Frames
with high reconstruction errors are flagged as anomalies since the model is
trained on normal data.
Threshold Setting: A threshold is set to classify frames as anomalous or
normal.
G. Postprocessing & Visualization
The detected anomalies are visualized by marking the frames where
the anomalies occurred. These frames are then compiled into a video,
showing when the system detects abnormal behaviour.
H. Output (Final Result)
The final output is a video or series of frames showing the detected
anomalies. A report summarizing the anomalies found is also generated.
Table 4.1.2 (d): : Confusion Matrix
Predicted Anomaly Predicted Normal

Actual Anomaly 45 5
Actual Normal 7 43
16
Fig. 4.1: Anomaly Detection Using CNN with Autoencoder
4.2 Experimental Analysis

The Video Anomaly Detection for Smart Surveillance system, based on
Autoencoders, showcases its strength in learning and detecting anomalies
from unsupervised video datasets. Qualitatively, the model leverages its
ability to identify abnormal behaviour in real-time without requiring labelled
data, making it suitable for dynamic environments. such as public
transportation hubs, office premises, and industrial zones. This unsupervised
learning approach is ideal for anomaly detection since it adapts to diverse
environments and scales well across large datasets like UCSD Ped2 and
ShanghaiTech.
Fig.4.2: LSTM – Autoencoder Model
17
It can also reconstruct error analysis, wherein the model compresses the
video frames and then reconstructs them. A high reconstruction error points
out the anomalies, hence detects the anomalies. Low false positives are one
of the significant qualitative advantages of this model since unnecessary
alerts in a surveillance system are very troublesome. In addition, it also
tolerates changes in illumination, crowd density, and scene complexity.
However, it has some limitations of being mostly subliminal and context-
dependent anomaly detection, such as slightly unusual human behaviors; it
might be missing them in scenes that are generally highly complex or noisy
due to defined reconstruction error thresholds. Moreover, the robust general
anomaly detection does not say much for the environments full of rare but
subtle anomalies that do not deviate quite strongly from the learned normal
behaviors.
Table 4.2: Dataset Statistics for Training and Testing Videos
No. of Videos No. of Videos Total Frames Total Frames

Dataset
(Training) (Testing) (Training) (Testing)
ShanghaiTech 330 107 274,515 42,883
USCD Ped2 16 12 2,550 2,010
CUHK Avenue 16 21 15,328 15,324
This table summarizes the statistics of the data set for the number of
videos and frames used during training and testing across three different data
sets: ShanghaiTech, USCD Ped2, and CUHK Avenue. It gives the overall
no. of frames analysed for the training and testing phases for emphasis on
the size as well as balance of the dataset.
18
4.3 Design Specifications
4.3.1 Software Requirements
Programming Language: Python
Python is chosen due to its strong ecosystem of libraries for machine
learning and image processing, as well as its ease of readability and
flexibility for rapid prototyping.
Machine Learning Libraries: TensorFlow/Keras
TensorFlow or Keras are required for building, training, and
deploying neural network models, especially deep learning architectures
like Convolutional Neural Networks (CNNs) and Autoencoders. These
libraries support complex neural network operations and allow for efficient
model optimization.
Video Processing and Computer Vision: OpenCV
OpenCV is essential for handling video inputs, including extracting
frames, resizing, normalizing, and preprocessing images from video
footage. These steps are crucial for preparing data for the neural network
models and ensuring consistency in input dimensions and formats.
Development Environment: Jupyter Notebook
Jupyter Notebook serves as the primary environment for developing
and experimenting with code, enabling easy debugging, visualization of
results, and iterative testing of model adjustments, which is critical in
machine learning workflows.
Data Management Capabilities
Managing large video datasets like the UCSD Pedestrian Dataset
requires tools capable of handling high volumes of data. These datasets are
used to train and validate the model by providing varied examples of
normal and abnormal events.
19
CHAPTER 5
RESULTS AND DISCUSSIONS
5.1 Results
Fig. 5.1 (a): Training & Validation Loss Vs Epochs
Fig. 5.1 (b): Training & Validation Accuracy Vs Epochs
20
Fig. 5.1 (c): Accuracy
Fig. 5.1 (d): Precision
21
Fig. 5.1 (e): Recall
Fig. 5.1 (f): Specificity
22
Fig. 5.1 (g): F1 Score
5.2 Accuracy
This is an anomaly detection system, and it detects abnormal video
surveillance events with maximum accuracy. The evaluated metrics of its
performance are precision and recall; in addition, this should efficiently
detect anomalies at a low false-positive rate. The methodology of this model
applies advanced techniques of machine learning, far superior to traditional
approaches, for the effective and reliable detection of anomalies.
Table 5.2: Accuracy of Proposed Model Across Different Batch

Sizes
Batch Size Accuracy (%)
32 85.2
64 87.5
128 89.8
256 91.3
23
5.3 Suggestions and Recommendations
Hyperparameter Fine-Tuning:
This can be done based upon the learning rates of the layers
configurations and activation functions that improve performance
specifically related to complex environments.
Multimodal Input Data:
Adding audio or other sensor data alongside the visual data could even
enhance robustness for an integrated anomaly detection system.
Data Augmentation:
More data augmentation methods can be applied to the model to get
generalization into various scenarios in low-light or crowded scenarios.
Minimize False Alarms:
Implement advanced post processing techniques such as noise
filtering and dynamic threshold adaptation to remove false alarms as well as
enhancing the reliability of the anomaly detection system.
5.4 Conclusion
The presented anomaly detection system significantly surpasses
traditional models due to marked improvements in accuracy and reliability
in various dynamic surveillance environments, most especially when
deployed. Advanced machine learning techniques such as using
autoencoders in conjunction with CNN-LSTM networks are used with
effective detection in real time without much latency. With this, the system
can function independently. Continuous human supervision will be
minimized, which can reduce resource use to an optimal point.
24
5.5 Future Enhancements
Adaptive Learning Over Multiple Environments:
Provide techniques for domain adaptation that allow the model to generalize
effectively over various settings such as urban, remote facilities, crowded public places,
etc. This way, there will be less need to retrain in new environments, thus enhancing
consistency and accuracy of detection across these contexts.
Improved Object Interaction Discovery with Graph Neural Networks:
The System utilizes GNNs to analyze relational data on objects like people and
vehicles. This should, therefore, allow the system to look more into complex interactions,
for instance close, unusual proximity or strange patterns of behavior, and make it better
for the discovery of suspicious interactions and social anomalies in surveillance footage.
Context-aware Temporal Anomaly Detection:
Use a model that combines the best of TCNs and Transformer models to take both
the short-range dependencies, as well as the long-range dependency in video sequences,
into account. This is because anomalies in the kind of motion or behavior over time would
be pretty small, capturing that is very vital for the identification of complex, context-
specific events in dynamic settings.
25
REFERENCES
[1] A. M.R., M. Makker and A. Ashok, "Anomaly Detection in Surveillance
Videos," 2019 26th International Conference on High Performance Computing,
Data and Analytics Workshop (HiPCW), Hyderabad, India, 2019, pp. 93-98, doi:
10.1109/HiPCW.2019.00031.
[2] A. B. Nassif, M. A. Talib, Q. Nasir and F. M. Dakalbab, "Machine Learning for
Anomaly Detection: A Systematic Review," in IEEE Access, vol. 9, pp. 78658-
78700, 2021, doi: 10.1109/ACCESS.2021.3083060.
[3] Zhang, L., Li, S., Luo, X. et al. Video anomaly detection with both normal and
anomaly memory modules. Vis Comput (2024). https://doi.org/10.1007/s00371-024-
03584-z
[4] Rezaiezadeh Roukerd, F., Rajabi, M.M. Anomaly detection in groundwater
monitoring data using LSTM-Autoencoder neural networks. Environ Monit
Assess 196, 692 (2024). https://doi.org/10.1007/s10661-024-12848-z.
[5] D. Kwon, K. Natarajan, S. C. Suh, H. Kim and J. Kim, "An Empirical Study on
Network Anomaly Detection Using Convolutional Neural Networks," 2018 IEEE
38th International Conference on Distributed Computing Systems (ICDCS), Vienna,
Austria, 2018, pp. 1595-1598, doi: 10.1109/ICDCS.2018.00178.
[6] M. Ganesh, A. Kumar and V. Pattabiraman, "Autoencoder Based Network Anomaly
Detection," 2020 IEEE International Conference on Technology, Engineering,
Management for Societal impact using Marketing, Entrepreneurship and Talent
(TEMSMET), Bengaluru, India, 2020, pp. 1-6, doi:
10.1109/TEMSMET51618.2020.9557464.
[7] H. Dhole, M. Sutaone and V. Vyas, "Anomaly Detection using Convolutional
Spatiotemporal Autoencoder," 2019 10th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp.
1-5, doi: 10.1109/ICCCNT45670.2019.8944523.
[8] T. -Y. WU, Z. Lee, Y. Huang, C. -M. Chen and Y. -C. Chen, "Security Analysis of
Wu et al.'s Authentication Protocol for Distributed Cloud Computing," 2019 IEEE
International Conference on Consumer Electronics - Taiwan (ICCE-TW), Yilan,
Taiwan, 2019, pp. 1-2, doi: 10.1109/ICCE-TW46550.2019.8991710.
[9] R. Vinayakumar, K. P. Soman and P. Poornachandran, "Long short-term memory
based operation log anomaly detection," 2017 International Conference on Advances
in Computing, Communications and Informatics (ICACCI), Udupi, India, 2017, pp.
236-242, doi: 10.1109/ICACCI.2017.8125846.
[10] T. Ergen and S. S. Kozat, "Neural networks based online learning," 2017 25th Signal
Processing and Communications Applications Conference (SIU), Antalya, Turkey,
2017, pp. 1-4, doi: 10.1109/SIU.2017.7960218.
[11] H. Yuqing, L. Shanshan and Z. Jian, "Multi-channel key frame extraction for video
surveillance system," 2022 2nd International Conference on Networking,
26
Communications and Information Technology (NetCIT), Manchester, United
Kingdom, 2022, pp. 83-85, doi: 10.1109/NetCIT57419.2022.00028.
[12] X. Qi, Z. Hu and G. Ji, "Retraining Generative Adversarial Autoencoder for Video
Anomaly Detection," in 2023 Eleventh International Conference on Advanced Cloud
and Big Data (CBD), Danzhou, China, 2023, pp. 63-68, doi:
10.1109/CBD63341.2023.00020.
[13] S. K. Dani, C. Thakur, N. Nagvanshi and G. Singh, "Anomaly Detection using PCA
in Time Series Data," 2024 IEEE International Conference on Interdisciplinary
Approaches in Technology and Management for Social Innovation (IATMSI),
Gwalior, India, 2024, pp. 1-6, doi: 10.1109/IATMSI60426.2024.10502929.
[14] Mishra, S., Jabin, S. Anomaly detection in surveillance videos using deep
autoencoder. Int. j. inf. tecnol. 16, 1111–1122 (2024).
https://doi.org/10.1007/s41870-023-01659-z.
[15] Gnouma, M., Ejbali, R., Zaied, M. (2023). Abnormal Event Detection Method Based
on Spatiotemporal CNN Hashing Model. In: Abraham, A., Pllana, S., Casalino, G.,
Ma, K., Bajaj, A. (eds) Intelligent Systems Design and Applications. ISDA 2022.
Lecture Notes in Networks and Systems, vol 717. Springer, Cham.
https://doi.org/10.1007/978-3-031-35510-3_16.
27

Uploaded by

Uploaded by

CHAPTER 1

Video surveillance has become very important in aspects of safety and

One of the basic challenges in surveillance is anomaly detection, or

Recent advances in machine learning, particularly neural networks,

This involves the design and implementation of a sophisticated, real-

This method has several benefits. Real-time processing introduces

The future development of automated surveillance systems would be

Recent advances in video anomaly detection led to two-stage self-

In parallel, efforts have been made utilizing MIL to improve the

Besides, Ganesh et al. introduced Multi-Sequence Learning (MSL),

In the multimodal information domain, Babanne et al. proposed

Recent research has also pivoted towards refining anomaly detection

3.1 Statement of the Problem

3.2 Scope of the study

3.3 Objective of the study

3.4 Realistic Constraints

3.5 Engineering Standards

4.1 Theoretical Analysis

Dataset Our Model STG-NF Jigsaw

USCD Ped2 99.7% 93.07% 98.88%

CUHK Avenue 92.8% 60.90% 91.41%

ShanghaiTech 87.72% 85.93% 84.26%

UBnormal 69.88% 71.78% 55.57%

This table compares the AUC performance of three models: Proposed

𝑓(𝑝, 𝑞 ) = 𝛽(𝑎, 𝑏). 𝛾 (𝑝 + 𝑎, 𝑞 + 𝑏) (4 .2)

where 𝛽 is the filter/kernel, and 𝛾 is the input image.

USCD Ped2 0.0025 0.0354

CUHK Avenue 0.0048 0.0491

Predicted Anomaly Predicted Normal

4.2 Experimental Analysis

Fig.4.2: LSTM – Autoencoder Model

No. of Videos No. of Videos Total Frames Total Frames

ShanghaiTech 330 107 274,515 42,883

USCD Ped2 16 12 2,550 2,010

CUHK Avenue 16 21 15,328 15,324

Fig. 5.1 (a): Training & Validation Loss Vs Epochs

Fig. 5.1 (b): Training & Validation Accuracy Vs Epochs

Fig. 5.1 (d): Precision

Fig. 5.1 (f): Specificity

Table 5.2: Accuracy of Proposed Model Across Different Batch

You might also like