Preprocessing, EDA and Forecast of a Short Time Series
Short Time Series Analysis
Trapped insects forecast based on meteorological data
This was a group project (w/ Leonardo Maria Marra) part of the Information Systems and Business Intelligence exam taken at the University of Naples Federico II
Objective
We are in the case where we want to predict the number of insects caught, or their occurrence using meteorological data together with the occurrence of the same in previous time intervals.
Data loading
The data are time series pertaining to insect catches in specific areas of the territory and the related meteorological data.
The data was provided in .xlsx
format and, after being appropriately transformed* is uploaded in .csv
format
* The transformation is explored in the Preprocessing section.
Preprocessing
At this stage the data is properly prepared for the next steps.
The data was initially split into different files. One containing the meteorological info and another regarding the catches data, for each one of the 5 areas.
Catches Data
DateTime | Total captured | New captures | Reviewed | Event |
---|---|---|---|---|
17.07.2024 | 1 | 1 | Yes | / |
18.07.2024 | 1 | 1 | Yes | / |
19.07.2024 | 1 | 0 | Yes | / |
20.07.2024 | 1 | 0 | Yes | / |
20.07.2024 | / | / | Yes | Cleaning |
... | ... | ... | ... | ... |
Meteorological Data
DateTime | Temperature | Umidity |
---|---|---|
17.07.2024 00:00:00 | 20,98 | 73,22 |
17.07.2024 01:00:00 | 23,74 | 59,92 |
17.07.2024 02:00:00 | 21,48 | 66,22 |
17.07.2024 03:00:00 | 19,62 | 72,62 |
17.07.2024 04:00:00 | 18,26 | 77,29 |
... | ... | ... |
It's important to notice the difference in sampling rate between the datasets.
The transformation mentioned above consists of:
- Removal of irrelevant columns and/or records concerning insect capture (Cleaning, Reviewed);
- Exclusion of redundant columns concerning weather information;
- Calculation of daily weather data and association with catch data.
This results in datasets having the following structure:
DateTime | Total captured | New captures | Temperature | Umidity |
---|---|---|---|---|
2024-08-17 | 0 | 0 | 28.244167 | 61.428750 |
2024-08-18 | 0 | 0 | 26.890000 | 64.881250 |
2024-08-19 | 0 | 0 | 25.890417 | 64.844167 |
2024-08-20 | 1 | 1 | 21.650417 | 83.205417 |
2024-08-21 | 1 | 0 | 23.003750 | 87.472083 |
... | ... | ... | ... | ... |
Exploratory Data Analysis
This section shows some of the plots produced to explore, understand and extract visual insights from the provided data.
Interactive plots can be visualized at the hosted dashboard.
Exploring variables relationship
This graph shows the relationship between Temperature, Humidity and the number of Catches:
It seems like the majority of Catches happen in a certain interval:
- 25°C < Temperature < 29°C
- 50% < Humidity < 73%
These intervals may represent the insects' preferred weather conditions.
Results
Multiple models were trained with this data, in the hosted dashboard there are plots of each model's forecast for each one of the time series.
The following table sums up the results in terms of the evaluation metric used (RMSE):
Model | Cicalino 1 | Cicalino 2 | Cicalino Merged | Imola 1 | Imola 2 | Imola 3 | Imola Merged |
---|---|---|---|---|---|---|---|
Decision Tree | 0.77 | 0.33 | 0.89 | 2.05 | 0.63 | 1.00 | 2.57 |
ARIMAX | 0.00 | 0.33 | 0.32 | 2.05 | 0.00 | 1.00 | 2.76 |
LSTM | 0.00 | 0.33 | 0.32 | 2.05 | 0.00 | 0.00 | 2.76 |
Links
- Google Colab notebook - Italian
- Streamlit dashboard - Hosted on Streamlit Community Cloud
- Github repository - Streamlit dashboard