Skip to content

LuluW8071/Building-LLM-from-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building-LLMs-from-Scratch

GPT1

🧠 What This Project Is About

Demystifying large language models by pretraining a Transformer Decoder architecture (GPT-2) from scratch on a clean, public domain dataset — Project Gutenberg.

This project serves as an educational and hands-on walkthrough to understand:

  • How the GPT architecture works under the hood
  • How to train language models using only a decoder block
  • How attention mechanisms empower modern LLMs

📚 Dataset

We use Project Gutenberg, a massive open-source corpus of books, for pretraining the language model.

  • Text data is cleaned, tokenized, and chunked into fixed-length sequences
  • A Byte-Pair Encoding (BPE) tokenizer or simple character-level tokenization is applied (custom or HuggingFace-supported)

🏗️ What's Implemented

  • Dataset Downloader (Project Gutenberg) and Pre-processor
  • Tokenizer (BPE-level)
  • Positional Encoding
  • Multi-Head Self Attention and Flash Attention
  • GPT-style Model (stacked decoder blocks)
  • Distributed GPU Training (Lightning Fabric)
  • Checkpoint saving & logging

🧾 Reference

📬 License

MIT License © 2025

About

GPT-2 Pre-Training from scratch with Flash Attention

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published