Data Engineering Zoomcamp 2025 Project

Overview

This repository hosts the source code and documentation for the 2025 project of the Data Engineering Zoomcamp's project for 2025.

The README content is also accessible at https://github.com/urbanclimatefr/de-zoomcamp-2025-project-attempt2

Key changes from first project attempt

Infrastructure as Code with Terraform was adopted.
A 5 minute batch processing data pipeline with Kestra was implemented considering the suggestion from the feedback that a streaming data pipeline could be adopted with the real time API. However, it is challenging to build a real time streaming pipeline under limited time constraint.
Both temperature and humidty data were fetched from API, transformation was done by python pandas in Kestra, considering the difficulty in using native python kestra tool to transform the json data.
dbt transformation was included to handle the calculation of the Hong Kong Heat Index (inputs are temperature and humidity)
A third page is added for the Hong Kong Heat Index.
Explanation on clustering and partitioning strategy of the final destination table in Bigquery was added.
More elaboration on the overall architecture, tech stack and what each of them does, were added.

Goal

The objective of this project is to develop an end-to-end batch data pipeline that includes ingestion, processing, transformation, persistence, and visualization. Utilizing data from the Hong Kong Observatory, users can access near real time temperature and Hong Kong heat index data (updated every 5 minutes) through a Looker Studio report.

Data Source

Hong Kong Observatory's API Web Service, which offers various APIs for collecting real-time weather data.

Data Collection

Rather than using pre-existing historical data, this project will gather data hourly and establish a batch processing pipeline to handle the data daily. The processed data will then be displayed in dashboards created using Google Cloud's Looker Studio.

Data Visualization

The culmination of this project is a Weather Report Looker report.

The initial page of the report displays summary statistics, allowing users to filter by Station and/or Date.

The first dashboard on this page indicates the number of records collected for the chosen station and date.

The second dashboard presents the lowest, average, and highest temperatures for the selected station and date.

The second page of the report illustrates the time series temperature data for the selected station and date range.

The third page of the report shows the Hong Kong heat index of various place in Hong Kong.

Flowchart

Overall Architecture

Data Ingestion Layer:- This layer handles raw data collection from various sources such as databases, APIs, streaming services, or files like CSV.
Tools: Kestra.
Purpose: Ensures raw data is securely and efficiently brought into the data platform.
Data Transformation Layer:- Transformation processes occur here, where raw data is cleaned, standardized, and structured to be useful for analysis.
Tools: dbt (Data Build Tool).
Purpose: Implements business logic (e.g., converting formats, deduplication, or aggregations).
Data Warehouse/Storage Layer:- Central repository for structured and validated data.
Tools: BigQuery (cloud-based)
Purpose: Stores clean, high-quality datasets optimized for querying and analysis.
Reporting and Analytics Layer:- Processes data from the warehouse for visualization, reporting, and advanced analytics.
Tools: Looker
Purpose: Provides actionable insights based on prepared data.

Tech Stack

| Component | Role/Function |

| BigQuery | Executes SQL-based transformations and hosts the data warehouse. |

| dbt (Data Build Tool) | Orchestrates and automates SQL transformations in BigQuery. |

| Google Cloud Storage | Temporarily stores raw data for ingestion. |

| Looker | Visualizes prepared data for reporting and business intelligence. |

Data Pipelines

See data pipelines

Prerequisites

Before executing the pipeline, the necessary infrastructure must be provisioned.

Infrastructure as code with Terraform

This involves creation of GCP storage, Bigquery infrasture as code

Terraform

Docker Container Creation

This involves building a local Kestra image and running Kestra and Postgres containers.

Infrastructure Setup with Terraform

Kestra

Lessons Learned

Early decisions on data visualization help in defining the scope and the type of processing required for the pipeline. Starting from the desired end state and working backward is an effective strategy to maintain focus.
Documentation can be as time-consuming as the development process itself.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Terrafrom		Terrafrom
doc		doc
workflow-orchestration		workflow-orchestration
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Zoomcamp 2025 Project

Overview

Key changes from first project attempt

Goal

Data Source

Data Collection

Data Visualization

Flowchart

Data Pipelines

Prerequisites

Lessons Learned

About

Uh oh!

Releases

Packages

Languages

License

urbanclimatefr/de-zoomcamp-2025-project-attempt2

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Zoomcamp 2025 Project

Overview

Key changes from first project attempt

Goal

Data Source

Data Collection

Data Visualization

Flowchart

Data Pipelines

Prerequisites

Lessons Learned

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages