This project is a fully cloud-based data pipeline built using Microsoft's Azure ecosystem. It ingests COVID-19 data from Our World in Data, processes it through various layers of transformation, and stores the curated data in an Azure SQL Database for visualization in Power BI.
Python API Our World in Data
↓
Azure Data Lake Gen2 (parquet)
↓
Synapse Serverless - VIEW bronze.covid19data
↓
Synapse Serverless - VIEW silver.tables (SQL transformations)
↓
Azure Data Factory - Loop + Copy Activity
↓
Azure SQL Database (gold.tables)
↓
Power BI - Direct connection to Azure SQL Database
Below is a screenshot of the ADF pipeline that orchestrates the data movement and transformation process:
This pipeline performs the following steps:
- Lookup Views: Reads metadata about views to be processed.
- Set Variable: Dynamically assigns view names.
- ForEach Loop: Iterates over each view and runs the copy activity.
- Copy Activity: Transfers data from Silver Layer to Gold Layer in Azure SQL.
✅ All activities are executed dynamically and succeeded during testing.
The main motivation for this project was to gain hands-on experience building a real-world, end-to-end data pipeline in the cloud. I wanted to simulate an actual enterprise-grade data flow using modern tools, with special attention to performance, cost-efficiency, and maintainability.
COVID-19 data was chosen because it's publicly available, updated daily, and offers a rich variety of metrics across different countries and dates—perfect for testing time-series transformations, incremental loads, and business intelligence use cases.
Our World in Data offers a comprehensive, reliable, and well-documented API for COVID-19 data. It's maintained by a reputable academic institution and includes detailed information about cases, deaths, testing, vaccinations, and more, covering all countries globally. This makes it an ideal data source for building a globally applicable analytics solution.
I chose the Microsoft Azure ecosystem for the following reasons:
- Hands-on Experience: Working with Azure services such as Synapse, Data Factory, and Data Lake allowed me to practice skills in high demand in the industry.
- Cost-Effectiveness: Services like Synapse Serverless and Azure SQL Database offer great performance at a low cost, making them suitable for individual projects.
- End-to-End Solution: Azure offers all components needed to build a complete pipeline—from ingestion and storage to transformation and visualization—within the same environment.
- Real-World Relevance: The design and tools mirror those used in professional data engineering teams, making the project experience as close as possible to enterprise use cases.
- Python – For extracting data from the API and writing to Azure Data Lake Gen2
- Azure Data Lake Gen2 – Raw and curated data storage in Parquet format
- Azure Synapse Analytics (Serverless) – SQL views for bronze and silver layer transformations
- Azure Data Factory – Orchestrating data movement and copy operations
- Azure SQL Database – Gold layer tables for serving data to Power BI
- Power BI – Reporting and data visualization
.
├── er_diagram/
├── powerbi/
├── sql/
│ ├── Bronze/bronze_covid19data.sql
│ ├── Silver/
│ ├── Dimensions/
│ ├── Facts/
│ └── Gold/
│ ├── Dimensions/
│ ├── Facts/
├── src/
│ ├── azure_upload.py
├── main.py
└── README.md
Power BI connects directly to the Azure SQL Database and visualizes trends such as annual cases, deaths, and vaccination progress in Canada over time.
Total Confirmed Cases & Deaths
Vaccination Progress (%)
Annual Case Trends
Covid-19 Tests
Hospitalization and ICU Rates Over Time
The interactive dashboard allows the user to filter and drill down into data per year, month, or date group, which facilitates deeper insights into the evolution of the pandemic in Canada.
To support the dashboard visuals, a dimensional data model was designed using the star schema approach. This consists of fact and dimension tables linked by primary and foreign keys, enabling optimized queries and performance in Power BI.
This project helped reinforce important concepts in data modeling, ETL, and interactive dashboard creation. By connecting Power BI to an Azure SQL Database, the entire flow from raw data ingestion to actionable insights was demonstrated.
Potential future improvements include:
Adding international comparisons (e.g., USA, BR).
Automating data refreshes via Azure Data Factory.
🏆 This end-to-end solution is a practical example of using modern cloud-based BI tools to deal with real-world public health challenges.