This project presents the Sensors Data Handler, which aims to build a model capable of assessing and predicting possible failures and abnormalities in industrial equipment based on data collected from sensors. This document provides an overview of the project and is organized into chapters as follows:
The Sensors Data Handler project focuses on developing a model to assess and predict potential failures and abnormalities in industrial equipment in real-time, leveraging data collected from sensors. The document is structured as follows:
- Chapter 2: Data Reading
- Chapter 3: Data Preprocessing, Filtering, and Cleaning
- Chapter 4: Exploratory Analysis
- Chapter 5: Machine Learning Model
- Chapter 6: Real-world Scenario with Apache Kafka
This chapter covers the data reading stage and techniques used in the development process.
The data was provided in CSV format, with each sensor monitoring the equipment having a file of records.
Two alternatives were considered for aggregating individual data into a single DataFrame: dynamic aggregation at runtime and pre-building a complete dataset. A script was created to construct the complete file if not found.
This stage involves methods such as head()
, tail()
, summarize()
, etc., to gain insights into the data. Duplicate rows were removed, and rows with all null values were cautiously eliminated. The DataFrame was reindexed with the timestamp column as the index, transforming it into a time series. Sorting based on the index was performed for better confidence in working with the data.
Initial checks included examining the proportion of normal and abnormal states of the equipment within the recorded values to understand the predominant behavior.
An analytical outlier identification method was applied due to the quantity of available data. The method proved accurate for the current case.
This step is essential for selecting the machine learning model. The correlation between sensor variables and the target (status) was analyzed. The results indicate predominantly independent sensors, leading to a multivariate analysis with independent features.
For detailed information, refer to the corresponding chapters in this document.
The Apache Kafka simulation within the Sensors Data Handler project allows for real-time data processing in an environment where sensor data is generated and consumed in real-time. The implementation uses Apache Kafka, an open-source stream processing platform developed by the Apache Software Foundation.
The primary concept involves processing sensor data at each timestamp, formatting it, and writing it to the Kafka queue (producer). Subsequently, the consumer listens to this queue, reads, processes, and writes new records to the PostgreSQL database. The machine learning model script is then triggered, consuming from the table and generating real-time updated analyses.
Kafka "topics" store the sequence of received events and can be replicated across different partitions of the broker. For this project, a single topic (SensorDataStream
) with two partitions was created.
Sensor data is read from the local file full.csv
and inserted row by row into the Kafka queue with a 1-second delay to simulate real conditions.
While messages are enqueued by the producer, the consumer processes them one by one, inserting them into the database and reading, processing, and formatting entries.
PostgreSQL, running in Docker, was chosen as the database. The implementation is simple, focusing on values emitted by the sensors.
As values are generated in real-time, the Naive Bayes model is executed each time new data is inserted into the database, thereby increasing its accuracy.
To run the project on your machine, follow these steps:
- Set up a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Install necessary dependencies in WSL:
sudo apt update
sudo apt upgrade
sudo apt install dos2unix
sudo apt install gnome-terminal
- Execute the setup.sh script:
chmod +x kafka/setup.sh
chmod +x setup.sh
./setup.sh
- More details in Docs/documentation_en_US.pdf or Docs/documentacao_pt_br.pdf