Chapters
Chapters
INTRODUCTION
1
An increase in demand has been more rapidly made for an automated
surveillance system that can quickly flag such anomalies. Such a system will
enable the security personnel to act with utmost speed and reliability.
2
The UCSD dataset, which forms a video data collection with normal
and anomalous behaviors within it, forms the foundational base to train and
test the model. The model is good at picking patterns within that dataset in
such a manner that generalizability to new footage exists, thus it can identify
abnormal patterns quite successfully with minimal false positives. It also
avails visual feedback in terms of the detection of anomalies through
bounding boxes that give personnel sufficient time for the potential threat to
be weighed.
3
generalization across environments. Integration with additional sensors such
as thermal cameras and audio combined with multi-modal systems may give
a more holistic view of the existence of threats under low lighting or
crowded areas. Reinforcement learning, theoretically, would actually hone
the system further, thus learning by self-improvement through actual
feedbacks from the real world. In fact, the whole aspect of cloud and edge
computing will really stand out, enabling scalable real-time processing.
Further development to make AI models more interpretable will continue
explaining the decision-making process. Such innovation will make
automated surveillance more proactive than reactive by preventing incidents
before they actually occur, with minimum human oversight required.
4
CHAPTER 2
LITERATURE SURVEY
5
localization accuracy. Meanwhile, Dhole et al. proposed a convolutional
spatiotemporal autoencoder for feature extraction in video sequences. Their
convolutional filter size hyperparameter modifications and the adopted
pooling strategies were relevant to achieve better temporal feature
extraction, which is very essential for anomaly detection.
This more and more comes into play as the field advances into the fact
that although diverse methods have highly explored modelling of the
temporal relation, most methods depend on parallel branches to introduce
more parameters, and thus to increase their computational costs. That is seen
in [9] where Vinayakumar et al proposed an innovative model that could
combine CNNs with LSTMs for bettering temporal dynamics. Their
hyperparameters of number of convolutional layers and sequences lengths
do highlight the appeal of fine-tuned models that not only are more accurate
but also less computationally expensive.
6
CHAPTER 3
RESEARCH METHODOLOGY
7
Real-time Anomaly Detection: Apply an autoencoder model that
detects anomalies in real-time with very low false positives and alerts are
sent in a timely manner as well. Monitor efficiently also.
System Evaluation and Optimization: Test the performance of the
model across different environments, further tune parameters to best
precision and check the results via accuracy, precision, recall, and F1 score.
The system is intended for automation with efficiency improvement
in the detection of anomalies that may be observed within public or office
settings or even industrial zones.
8
Resource Demands: It requires pretty serious powerful hardware,
good quantities of memory, and high-speed GPUs for deployment. It might
be a heavy load. Such challenges arise especially in real-time applications.
Speed vs. Accuracy: This can be one of the toughest criteria to deal
with: fast, real-time detection with high accuracy. Reliability in video frame
processing does not necessarily entail speed.
Error: This surveillance fails to detect a threat if there are humans,
or even identifies normal activities as threats - mostly occurs especially in
crowded environments or complex scenarios where such an error can hardly
be minimized.
Varied Scenes: It confuses with changes in lighting, or diverse angles
of view, or different crowd sizes, etc. It is hard to adapt to such diversity and
be reliable.
Compatibility: This highly sophisticated system is going to be
installed within already existing security infrastructure. Its use ought to be
very smooth and fit well without much disruption in the present layout.
Cost and Scaling: Scalability for large areas is expensive, and scaling
up to work with multiple locations without consuming a lot of resources is
the big challenge.
9
maintainability and usability so that the system functions fine in real-time
environments.
Design of the Machine Learning Model:
Follows the best practices related to ethical considerations in machine
learning including fairness, transparency, and accountability while training
and at the deployment time of the models toward anomaly detection.
Video Compression and Transmission:
Ensuring video compression efficiency so data can be transmitted
real-time without loss of quality in video processing and monitoring.
10
CHAPTER 4
DESIGN AND METHODOLOGY
11
Visualization Module: Highlights detected anomalies using
bounding boxes and presents clear alerts.
4.1.2 Methodology
The methodology for this project involves several critical stages, as
represented in the block diagram. The entire process can be broken down
into the following detailed steps, starting from the input video dataset to the
final output of anomaly detection:
A. Input Video Dataset (UCSD Dataset)
The input to the system consists of videos from the UCSD dataset.
This dataset includes surveillance videos recorded in various environments,
containing both normal and anomalous behaviours.
Table 4.1.2 (a): Details of UCSD Pedestrian Dataset (Ped1 and Ped2)
Dataset No. of Videos No. of Videos Resolution Anomalous Events
(Training) (Testing)
People walking
UCSD Ped1 34 36 238 × 158 outside designated
paths, bikes, cars,
etc.
Similar anomalies,
UCSD Ped2 16 12 360 × 240 such as non-
pedestrian objects
on walkways.
This table summarizes the UCSD Ped1 and Ped2 datasets, detailing
the number of training and testing videos, resolutions, and types of
anomalous events. Anomalies include people walking outside designated
paths, vehicles, and non-pedestrian objects on walkways.
12
B. Convert Videos to Frames (Frame Extraction)
The videos are split into individual frames. Each frame is treated as a
separate image for further processing.
Reason: Video anomaly detection works at the frame level, as processing
each frame individually allows the model to detect sudden irregularities.
Table 4.1.2 (b): AUC Comparison Across Datasets for Different Models
13
Normalization: Pixel values are normalized to a range between 0 and 1
to speed up convergence during model training.
𝑥 − 𝑚𝑖𝑛(𝑥 )
𝑥 = (4.1)
𝑚𝑎𝑥 (𝑥 ) − 𝑚𝑖𝑛(𝑥 )
where x is the original pixel value.
Data Augmentation: Various transformations such as flipping, rotating,
and cropping are applied to the frames to create variations in the training
data. This helps the model generalize better.
D. Feature Extraction
Convolutional Neural Networks (CNN): CNNs are applied to the
frames to extract spatial features. The layers of CNNs perform convolution
operations to detect patterns such as edges, corners, and textures in the
images.
Layers:
• Convolution Layer: Detects low-level features using filters.
• Activation Function (ReLU): Introduces non-linearity into the
model.
• Pooling Layer: Reduces the spatial dimensions of the feature maps.
14
E. Model Training
Autoencoder: An autoencoder is used to learn a compressed
representation of the normal data. It consists of two main parts:
• Encoder: Compresses the input into a latent space representation.
• Decoder: Reconstructs the input from the compressed data.
L = ∥ 𝐴 − 𝐴̅ ∥ (4.4)
where A is the input, and 𝐴̅ is the reconstructed output. The loss function L
minimizes the reconstruction error.
Table 4.1.2 (c): Average Reconstruction Error Comparison for Normal and
Anomalous Events
Average Reconstruction
Average Reconstruction
Dataset Error
Error (Anomalous)
(Normal)
ShanghaiTech
0.0032 0.0467
This table shows the average reconstruction error for normal and
anomalous events across three datasets: ShanghaiTech, UCSD Ped2, and
CUHK Avenue. Anomalous events consistently have a higher
reconstruction error, indicating the model's effectiveness in distinguishing
between normal and abnormal behaviour.
LSTM (Long Short-Term Memory): LSTM networks are used to
capture temporal patterns between frames. This is crucial for video data as
anomalies may occur across consecutive frames.
15
𝑎 = 𝜎(𝜔 . [𝑔 ,𝑥 ] + 𝑝 ) (4.5)
𝑏 = 𝜎 ( 𝜔 . [𝑔 ,𝑥 ] + 𝑝 ) (4.6)
𝐶 = tanh(𝜔 . [𝑔 ,𝑥 ] + 𝑝 ) (4.7)
𝐶 = 𝑎 ∗ 𝐶 + 𝑖 ∗ 𝐶 (4.8)
Where at is the forget gate, bt is the input gate, Ct is the candidate cell
state.
F. Anomaly Detection
Anomalies are detected based on the reconstruction error. Frames
with high reconstruction errors are flagged as anomalies since the model is
trained on normal data.
Threshold Setting: A threshold is set to classify frames as anomalous or
normal.
G. Postprocessing & Visualization
The detected anomalies are visualized by marking the frames where
the anomalies occurred. These frames are then compiled into a video,
showing when the system detects abnormal behaviour.
H. Output (Final Result)
The final output is a video or series of frames showing the detected
anomalies. A report summarizing the anomalies found is also generated.
Table 4.1.2 (d): : Confusion Matrix
16
Fig. 4.1: Anomaly Detection Using CNN with Autoencoder
17
It can also reconstruct error analysis, wherein the model compresses the
video frames and then reconstructs them. A high reconstruction error points
out the anomalies, hence detects the anomalies. Low false positives are one
of the significant qualitative advantages of this model since unnecessary
alerts in a surveillance system are very troublesome. In addition, it also
tolerates changes in illumination, crowd density, and scene complexity.
However, it has some limitations of being mostly subliminal and context-
dependent anomaly detection, such as slightly unusual human behaviors; it
might be missing them in scenes that are generally highly complex or noisy
due to defined reconstruction error thresholds. Moreover, the robust general
anomaly detection does not say much for the environments full of rare but
subtle anomalies that do not deviate quite strongly from the learned normal
behaviors.
Table 4.2: Dataset Statistics for Training and Testing Videos
This table summarizes the statistics of the data set for the number of
videos and frames used during training and testing across three different data
sets: ShanghaiTech, USCD Ped2, and CUHK Avenue. It gives the overall
no. of frames analysed for the training and testing phases for emphasis on
the size as well as balance of the dataset.
18
4.3 Design Specifications
4.3.1 Software Requirements
Programming Language: Python
Python is chosen due to its strong ecosystem of libraries for machine
learning and image processing, as well as its ease of readability and
flexibility for rapid prototyping.
Machine Learning Libraries: TensorFlow/Keras
TensorFlow or Keras are required for building, training, and
deploying neural network models, especially deep learning architectures
like Convolutional Neural Networks (CNNs) and Autoencoders. These
libraries support complex neural network operations and allow for efficient
model optimization.
Video Processing and Computer Vision: OpenCV
OpenCV is essential for handling video inputs, including extracting
frames, resizing, normalizing, and preprocessing images from video
footage. These steps are crucial for preparing data for the neural network
models and ensuring consistency in input dimensions and formats.
Development Environment: Jupyter Notebook
Jupyter Notebook serves as the primary environment for developing
and experimenting with code, enabling easy debugging, visualization of
results, and iterative testing of model adjustments, which is critical in
machine learning workflows.
Data Management Capabilities
Managing large video datasets like the UCSD Pedestrian Dataset
requires tools capable of handling high volumes of data. These datasets are
used to train and validate the model by providing varied examples of
normal and abnormal events.
19
CHAPTER 5
RESULTS AND DISCUSSIONS
5.1 Results
20
Fig. 5.1 (c): Accuracy
21
Fig. 5.1 (e): Recall
22
Fig. 5.1 (g): F1 Score
5.2 Accuracy
This is an anomaly detection system, and it detects abnormal video
surveillance events with maximum accuracy. The evaluated metrics of its
performance are precision and recall; in addition, this should efficiently
detect anomalies at a low false-positive rate. The methodology of this model
applies advanced techniques of machine learning, far superior to traditional
approaches, for the effective and reliable detection of anomalies.
23
5.3 Suggestions and Recommendations
Hyperparameter Fine-Tuning:
This can be done based upon the learning rates of the layers
configurations and activation functions that improve performance
specifically related to complex environments.
Multimodal Input Data:
Adding audio or other sensor data alongside the visual data could even
enhance robustness for an integrated anomaly detection system.
Data Augmentation:
More data augmentation methods can be applied to the model to get
generalization into various scenarios in low-light or crowded scenarios.
Minimize False Alarms:
Implement advanced post processing techniques such as noise
filtering and dynamic threshold adaptation to remove false alarms as well as
enhancing the reliability of the anomaly detection system.
5.4 Conclusion
The presented anomaly detection system significantly surpasses
traditional models due to marked improvements in accuracy and reliability
in various dynamic surveillance environments, most especially when
deployed. Advanced machine learning techniques such as using
autoencoders in conjunction with CNN-LSTM networks are used with
effective detection in real time without much latency. With this, the system
can function independently. Continuous human supervision will be
minimized, which can reduce resource use to an optimal point.
24
5.5 Future Enhancements
Adaptive Learning Over Multiple Environments:
Provide techniques for domain adaptation that allow the model to generalize
effectively over various settings such as urban, remote facilities, crowded public places,
etc. This way, there will be less need to retrain in new environments, thus enhancing
consistency and accuracy of detection across these contexts.
Improved Object Interaction Discovery with Graph Neural Networks:
The System utilizes GNNs to analyze relational data on objects like people and
vehicles. This should, therefore, allow the system to look more into complex interactions,
for instance close, unusual proximity or strange patterns of behavior, and make it better
for the discovery of suspicious interactions and social anomalies in surveillance footage.
Context-aware Temporal Anomaly Detection:
Use a model that combines the best of TCNs and Transformer models to take both
the short-range dependencies, as well as the long-range dependency in video sequences,
into account. This is because anomalies in the kind of motion or behavior over time would
be pretty small, capturing that is very vital for the identification of complex, context-
specific events in dynamic settings.
25
REFERENCES
[1] A. M.R., M. Makker and A. Ashok, "Anomaly Detection in Surveillance
Videos," 2019 26th International Conference on High Performance Computing,
Data and Analytics Workshop (HiPCW), Hyderabad, India, 2019, pp. 93-98, doi:
10.1109/HiPCW.2019.00031.
[2] A. B. Nassif, M. A. Talib, Q. Nasir and F. M. Dakalbab, "Machine Learning for
Anomaly Detection: A Systematic Review," in IEEE Access, vol. 9, pp. 78658-
78700, 2021, doi: 10.1109/ACCESS.2021.3083060.
[3] Zhang, L., Li, S., Luo, X. et al. Video anomaly detection with both normal and
anomaly memory modules. Vis Comput (2024). https://doi.org/10.1007/s00371-024-
03584-z
[4] Rezaiezadeh Roukerd, F., Rajabi, M.M. Anomaly detection in groundwater
monitoring data using LSTM-Autoencoder neural networks. Environ Monit
Assess 196, 692 (2024). https://doi.org/10.1007/s10661-024-12848-z.
[5] D. Kwon, K. Natarajan, S. C. Suh, H. Kim and J. Kim, "An Empirical Study on
Network Anomaly Detection Using Convolutional Neural Networks," 2018 IEEE
38th International Conference on Distributed Computing Systems (ICDCS), Vienna,
Austria, 2018, pp. 1595-1598, doi: 10.1109/ICDCS.2018.00178.
[6] M. Ganesh, A. Kumar and V. Pattabiraman, "Autoencoder Based Network Anomaly
Detection," 2020 IEEE International Conference on Technology, Engineering,
Management for Societal impact using Marketing, Entrepreneurship and Talent
(TEMSMET), Bengaluru, India, 2020, pp. 1-6, doi:
10.1109/TEMSMET51618.2020.9557464.
[7] H. Dhole, M. Sutaone and V. Vyas, "Anomaly Detection using Convolutional
Spatiotemporal Autoencoder," 2019 10th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), Kanpur, India, 2019, pp.
1-5, doi: 10.1109/ICCCNT45670.2019.8944523.
[8] T. -Y. WU, Z. Lee, Y. Huang, C. -M. Chen and Y. -C. Chen, "Security Analysis of
Wu et al.'s Authentication Protocol for Distributed Cloud Computing," 2019 IEEE
International Conference on Consumer Electronics - Taiwan (ICCE-TW), Yilan,
Taiwan, 2019, pp. 1-2, doi: 10.1109/ICCE-TW46550.2019.8991710.
[9] R. Vinayakumar, K. P. Soman and P. Poornachandran, "Long short-term memory
based operation log anomaly detection," 2017 International Conference on Advances
in Computing, Communications and Informatics (ICACCI), Udupi, India, 2017, pp.
236-242, doi: 10.1109/ICACCI.2017.8125846.
[10] T. Ergen and S. S. Kozat, "Neural networks based online learning," 2017 25th Signal
Processing and Communications Applications Conference (SIU), Antalya, Turkey,
2017, pp. 1-4, doi: 10.1109/SIU.2017.7960218.
[11] H. Yuqing, L. Shanshan and Z. Jian, "Multi-channel key frame extraction for video
surveillance system," 2022 2nd International Conference on Networking,
26
Communications and Information Technology (NetCIT), Manchester, United
Kingdom, 2022, pp. 83-85, doi: 10.1109/NetCIT57419.2022.00028.
[12] X. Qi, Z. Hu and G. Ji, "Retraining Generative Adversarial Autoencoder for Video
Anomaly Detection," in 2023 Eleventh International Conference on Advanced Cloud
and Big Data (CBD), Danzhou, China, 2023, pp. 63-68, doi:
10.1109/CBD63341.2023.00020.
[13] S. K. Dani, C. Thakur, N. Nagvanshi and G. Singh, "Anomaly Detection using PCA
in Time Series Data," 2024 IEEE International Conference on Interdisciplinary
Approaches in Technology and Management for Social Innovation (IATMSI),
Gwalior, India, 2024, pp. 1-6, doi: 10.1109/IATMSI60426.2024.10502929.
[14] Mishra, S., Jabin, S. Anomaly detection in surveillance videos using deep
autoencoder. Int. j. inf. tecnol. 16, 1111–1122 (2024).
https://doi.org/10.1007/s41870-023-01659-z.
[15] Gnouma, M., Ejbali, R., Zaied, M. (2023). Abnormal Event Detection Method Based
on Spatiotemporal CNN Hashing Model. In: Abraham, A., Pllana, S., Casalino, G.,
Ma, K., Bajaj, A. (eds) Intelligent Systems Design and Applications. ISDA 2022.
Lecture Notes in Networks and Systems, vol 717. Springer, Cham.
https://doi.org/10.1007/978-3-031-35510-3_16.
27