This script processes raw 5G network performance data files into a clean and structured dataset ready for clustering and time-series forecasting tasks.
To consolidate and clean raw network measurement data from multiple CSV files—merging them, extracting essential features, computing derived metrics, and saving a cleaned dataset for use in machine learning pipelines.
-
Source: All CSV files must be located in the following OneDrive folder:
-
Important:
- Before running the script, download the raw
.csv
files from the OneDrive folder. - Place them in the local path:
<Local Directory>/5G Zone Prediction System/RawData
- Before running the script, download the raw
COS4007
├── dataset/
│ ├── raw_data/ <-- Raw CSV files from OneDrive go here
│ ├── cleaned_data.csv
│ ├── combined_raw_data.csv
│ ├── data_cleanup.py
│ ├── data_summary.py
│ ├── explore_data.py
│ ├── feature_engineered_data.csv
│ ├── feature_exploration.py
│ ├── features.py
│ ├── raw_data_combine.py
├── gui/
│ ├── main.py
├── model/
│ ├── clustering.py
│ ├── time_series.py
├── results/
│ ├── clustering/ <-- CSVs of metrics and trained model
│ ├── data/ <-- Images of exploration of clean data
│ ├── eda/ <-- Data Exploration
│ ├── feature_eda/ <-- Feature Data Exploration
│ ├── time_series/ <-- CSVs of metrics and trained model
- File:
5G Zone Prediction System/ProcessedData/clean_data.csv
- Contains cleaned and feature-enriched records:
- Time features, GPS coordinates, speed, server latencies, transfer sizes, bitrate metrics
- Derived metrics:
total_throughput
,total_bandwidth
,average_latency
- Ensure you have downloaded and placed the raw data in
./5G Zone Prediction System/RawData
- Execute:
python data_cleanup.py
- This script assumes all CSVs share the same schema and have no conflicting header definitions.
Convert_time
is the key datetime field for temporal analysis and forecasting.- The cleaned dataset is used by:
- Kmean training pipeline
This Jupyter Notebook trains a KMeans clustering model to classify geographical zones based on 5G network performance metrics.
<Local Directory>/5G Zone Prediction System/ModelTraining/Clustering/Train-Clustering.py
To group data points representing physical locations into performance zones using unsupervised KMeans clustering.
latitude
,longitude
average_latency
total_throughput
total_bandwidth
- Load the cleaned dataset (
clean_data.csv
) - Apply MinMax scaling to selected features
- Train multiple KMeans models (K = 1 to 10)
- Evaluate using silhouette scores
- Save the best performing model and scaler
- Trained KMeans model and scaler using
joblib
- CSV summary of clusters with performance labels
- Ensure input file:
clean_data.csv
is placed correctly - The model is later used by the GUI to label new inputs based on location and network metrics
This notebook trains an ARIMA model for time-series forecasting of 5G network performance (total throughput) using historical data.
<Local Directory>/5G Zone Prediction System/ModelTraining/TimeSeries/Train-TimeSeries.py
To forecast hourly future throughput values using historical throughput trends and time-based features.
total_throughput
(target)- Exogenous:
hour
of the dayday_of_week
- Load and parse
clean_data.csv
- Perform similar feature engineering like the clean_data_Training.csv and resample to hourly data
- Generate time features
- Split into train and test sets
- Fit ARIMA model with external regressors
- Evaluate with RMSE and MAE
- Save the trained model using
pickle
arima_model.pkl
: Trained ARIMA model- Plots: Predicted vs Actual throughput
- Ensure timestamps are in the format
YYYY-MM-DD HH:MM:SS
- Data before
2022-07-20 13:00:00
is used for training
This script is the main entry point for the 5G Zone Prediction System, featuring a GUI for interacting with trained clustering and time series models.
<Local Directory>/5G Zone Prediction System/main.py
To provide a user-friendly interface to:
- Lookup performance zone of a location using KMeans clustering
- Forecast future throughput using an ARIMA model
- Visualize performance heatmaps from CSV datasets
- KMeans Clustering Model (for zone labeling)
- ARIMA Time Series Model (for throughput forecasting)
All models are pre-trained and loaded from the TrainedModel/
directory.
- Enter latitude and longitude
- Find closest matching zone and its performance label
- Input start and end time (e.g.,
2025-06-01 14:00
) - Output hourly throughput predictions using ARIMA
- Load CSV with raw performance metrics
- Predict zones for all entries
- Display and save heatmap image (
performance_map.png
)
Ensure the following trained assets are available:
TrainedModel/
├── Clustering/
│ ├── cluster_label_kmeans.pkl
│ ├── cluster_label_scaler.pkl
│ ├── clustering_output.csv
│ └── zone_cluster_map.csv
└── TimeSeries/
└── arima_model.pkl
python main.py
Python GUI will launch with 3 main functions.
- Valid latitude/longitude bounds are checked dynamically using
zone_cluster_map.csv
- The output
performance_map.png
will be saved in the script's directory - Required CSV for heatmap must include:
latitude
,longitude
upload_bitrate_mbits/sec
,download_bitrate_rx_mbits/sec
upload_transfer_size_mbytes
,download_transfer_size_rx_mbytes
svr1
tosvr4