Demystifying large language models by pretraining a Transformer Decoder architecture (GPT-2) from scratch on a clean, public domain dataset — Project Gutenberg.
This project serves as an educational and hands-on walkthrough to understand:
- How the GPT architecture works under the hood
- How to train language models using only a decoder block
- How attention mechanisms empower modern LLMs
We use Project Gutenberg, a massive open-source corpus of books, for pretraining the language model.
- Text data is cleaned, tokenized, and chunked into fixed-length sequences
- A Byte-Pair Encoding (BPE) tokenizer or simple character-level tokenization is applied (custom or HuggingFace-supported)
- Dataset Downloader (Project Gutenberg) and Pre-processor
- Tokenizer (BPE-level)
- Positional Encoding
- Multi-Head Self Attention and Flash Attention
- GPT-style Model (stacked decoder blocks)
- Distributed GPU Training (Lightning Fabric)
- Checkpoint saving & logging
MIT License © 2025