data_cleanup.py

This script processes raw 5G network performance data files into a clean and structured dataset ready for clustering and time-series forecasting tasks.

Purpose

To consolidate and clean raw network measurement data from multiple CSV files—merging them, extracting essential features, computing derived metrics, and saving a cleaned dataset for use in machine learning pipelines.

Input

Source: All CSV files must be located in the following OneDrive folder:

OneDrive Theme 5 Raw Data Folder
Important:
- Before running the script, download the raw .csv files from the OneDrive folder.
- Place them in the local path:
```
<Local Directory>/5G Zone Prediction System/RawData
```

Updated Directory Configuration

COS4007
├── dataset/
│   ├── raw_data/                <-- Raw CSV files from OneDrive go here
│   ├── cleaned_data.csv
│   ├── combined_raw_data.csv
│   ├── data_cleanup.py
│   ├── data_summary.py
│   ├── explore_data.py
│   ├── feature_engineered_data.csv
│   ├── feature_exploration.py
│   ├── features.py
│   ├── raw_data_combine.py    
├── gui/
│   ├── main.py         
├── model/
│   ├── clustering.py
│   ├── time_series.py
├── results/
│   ├── clustering/            <-- CSVs of metrics and trained model
│   ├── data/                  <-- Images of exploration of clean data
│   ├── eda/                   <-- Data Exploration
│   ├── feature_eda/           <-- Feature Data Exploration
│   ├── time_series/           <-- CSVs of metrics and trained model

Output

File: 5G Zone Prediction System/ProcessedData/clean_data.csv
Contains cleaned and feature-enriched records:
- Time features, GPS coordinates, speed, server latencies, transfer sizes, bitrate metrics
- Derived metrics: total_throughput, total_bandwidth, average_latency

How to Run

Ensure you have downloaded and placed the raw data in ./5G Zone Prediction System/RawData
Execute:
```
python data_cleanup.py
```

Notes

This script assumes all CSVs share the same schema and have no conflicting header definitions.
Convert_time is the key datetime field for temporal analysis and forecasting.
The cleaned dataset is used by:
- Kmean training pipeline

Train-Clustering.py

This Jupyter Notebook trains a KMeans clustering model to classify geographical zones based on 5G network performance metrics.

Location

<Local Directory>/5G Zone Prediction System/ModelTraining/Clustering/Train-Clustering.py

Purpose

To group data points representing physical locations into performance zones using unsupervised KMeans clustering.

Features Used

latitude, longitude
average_latency
total_throughput
total_bandwidth

Workflow

Load the cleaned dataset (clean_data.csv)
Apply MinMax scaling to selected features
Train multiple KMeans models (K = 1 to 10)
Evaluate using silhouette scores
Save the best performing model and scaler

Outputs

Trained KMeans model and scaler using joblib
CSV summary of clusters with performance labels

Usage Notes

Ensure input file: clean_data.csv is placed correctly
The model is later used by the GUI to label new inputs based on location and network metrics

Train-TimeSeries.py

This notebook trains an ARIMA model for time-series forecasting of 5G network performance (total throughput) using historical data.

Location

<Local Directory>/5G Zone Prediction System/ModelTraining/TimeSeries/Train-TimeSeries.py

Purpose

To forecast hourly future throughput values using historical throughput trends and time-based features.

Features

total_throughput (target)
Exogenous:
- hour of the day
- day_of_week

Workflow

Load and parse clean_data.csv
Perform similar feature engineering like the clean_data_Training.csv and resample to hourly data
Generate time features
Split into train and test sets
Fit ARIMA model with external regressors
Evaluate with RMSE and MAE
Save the trained model using pickle

Outputs

arima_model.pkl: Trained ARIMA model
Plots: Predicted vs Actual throughput

Usage Notes

Ensure timestamps are in the format YYYY-MM-DD HH:MM:SS
Data before 2022-07-20 13:00:00 is used for training

main.py - 5G Zone Prediction System GUI

This script is the main entry point for the 5G Zone Prediction System, featuring a GUI for interacting with trained clustering and time series models.

Location

<Local Directory>/5G Zone Prediction System/main.py

Purpose

To provide a user-friendly interface to:

Lookup performance zone of a location using KMeans clustering
Forecast future throughput using an ARIMA model
Visualize performance heatmaps from CSV datasets

Models Integrated

KMeans Clustering Model (for zone labeling)
ARIMA Time Series Model (for throughput forecasting)

All models are pre-trained and loaded from the TrainedModel/ directory.

GUI Tabs Overview

Zone Lookup

Enter latitude and longitude
Find closest matching zone and its performance label

Forecast Throughput

Input start and end time (e.g., 2025-06-01 14:00)
Output hourly throughput predictions using ARIMA

CSV Zone Heatmap

Load CSV with raw performance metrics
Predict zones for all entries
Display and save heatmap image (performance_map.png)

Required Files

Ensure the following trained assets are available:

TrainedModel/
├── Clustering/
│   ├── cluster_label_kmeans.pkl
│   ├── cluster_label_scaler.pkl
│   ├── clustering_output.csv
│   └── zone_cluster_map.csv
└── TimeSeries/
    └── arima_model.pkl

How to Run

python main.py

Python GUI will launch with 3 main functions.

Notes

Valid latitude/longitude bounds are checked dynamically using zone_cluster_map.csv
The output performance_map.png will be saved in the script's directory
Required CSV for heatmap must include:
- latitude, longitude
- upload_bitrate_mbits/sec, download_bitrate_rx_mbits/sec
- upload_transfer_size_mbytes, download_transfer_size_rx_mbytes
- svr1 to svr4

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
DataExploration		DataExploration
Main		Main
ModelTraining		ModelTraining
RawData		RawData
Results		Results
TrainedModel		TrainedModel
LICENSE		LICENSE
ReadMe.md		ReadMe.md

License

ruchanswin/COS40007-GroupProject

Folders and files

Latest commit

History

Repository files navigation

data_cleanup.py

Purpose

Input

Updated Directory Configuration

Output

How to Run

Notes

Train-Clustering.py

Location

Purpose

Features Used

Workflow

Outputs

Usage Notes

Train-TimeSeries.py

Location

Purpose

Features

Workflow

Outputs

Usage Notes

main.py - 5G Zone Prediction System GUI

Location

Purpose

Models Integrated

GUI Tabs Overview

Zone Lookup

Forecast Throughput

CSV Zone Heatmap

Required Files

How to Run

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages