Building-LLMs-from-Scratch

🧠 What This Project Is About

Demystifying large language models by pretraining a Transformer Decoder architecture (GPT-2) from scratch on a clean, public domain dataset — Project Gutenberg.

This project serves as an educational and hands-on walkthrough to understand:

How the GPT architecture works under the hood
How to train language models using only a decoder block
How attention mechanisms empower modern LLMs

📚 Dataset

We use Project Gutenberg, a massive open-source corpus of books, for pretraining the language model.

Text data is cleaned, tokenized, and chunked into fixed-length sequences
A Byte-Pair Encoding (BPE) tokenizer or simple character-level tokenization is applied (custom or HuggingFace-supported)

🏗️ What's Implemented

Dataset Downloader (Project Gutenberg) and Pre-processor
Tokenizer (BPE-level)
Positional Encoding
Multi-Head Self Attention and Flash Attention
GPT-style Model (stacked decoder blocks)
Distributed GPU Training (Lightning Fabric)
Checkpoint saving & logging

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Building-LLMs-from-Scratch

🧠 What This Project Is About

📚 Dataset

🏗️ What's Implemented

🧾 Reference

📬 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

LuluW8071/Building-LLM-from-Scratch

Folders and files

Latest commit

History

Repository files navigation

Building-LLMs-from-Scratch

🧠 What This Project Is About

📚 Dataset

🏗️ What's Implemented

🧾 Reference

📬 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages