Skip to content

A comprehensive template for training and evaluating deep learning models using PyTorch and the Hugging Face ecosystem. This template provides a well-structured foundation for NLP and multimodal projects with support for custom models, datasets, and training configurations.

Notifications You must be signed in to change notification settings

charlieJ107/huggingface-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€— Transformers Template Project

A comprehensive template for training and evaluating deep learning models using PyTorch and the Hugging Face ecosystem. This template provides a well-structured foundation for NLP and multimodal projects with support for custom models, datasets, and training configurations.

πŸ“‹ Table of Contents

✨ Features

  • πŸ—οΈ Modular Architecture: Clean separation of models, datasets, training, and utilities
  • πŸ€— Hugging Face Integration: Built-in support for Transformers, Datasets, and Accelerate
  • ⚑ Distributed Training: Multi-GPU and multi-node training with Accelerate
  • πŸ“Š Comprehensive Logging: Built-in experiment tracking and visualization
  • πŸ”§ Flexible Configuration: YAML-based configuration system
  • πŸ“¦ Easy Deployment: Support for both conda and pip environments
  • πŸ§ͺ Testing Framework: Structured testing and evaluation pipeline
  • πŸ““ Jupyter Support: Interactive development with notebook examples

πŸ—οΈ Project Structure

huggingface-template/
β”œβ”€β”€ config/                     # Configuration files
β”‚   β”œβ”€β”€ accelerate_config.yaml  # Accelerate configuration
β”‚   └── training_args.yaml      # Training arguments
β”œβ”€β”€ data/                       # Data directories
β”‚   β”œβ”€β”€ raw/                    # Raw data
β”‚   β”œβ”€β”€ processed/              # Processed data
β”‚   β”œβ”€β”€ interim/                # Intermediate data
β”‚   └── external/               # External data sources
β”œβ”€β”€ datasets/                   # Dataset implementations
β”‚   └── example_dataset.py      # Example dataset class
β”œβ”€β”€ models/                     # Model implementations
β”‚   β”œβ”€β”€ pretrained_model/       # Custom pretrained models
β”‚   β”‚   β”œβ”€β”€ pretrained_model.py
β”‚   β”‚   └── pretrained_model_config.py
β”‚   └── other/                  # Other model architectures
β”œβ”€β”€ training/                   # Training scripts
β”‚   β”œβ”€β”€ pretrained_model/       # Training scripts for pretrained models
β”‚   β”‚   └── train.py
β”‚   └── other/                  # Other training scripts
β”œβ”€β”€ processing/                 # Data processing utilities
β”‚   └── my_processor.py         # Custom processor implementation
β”œβ”€β”€ utils/                      # Utility functions
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── training_args.py        # Training argument utilities
β”œβ”€β”€ visualization/              # Visualization utilities
β”‚   └── visualization.py
β”œβ”€β”€ notebooks/                  # Jupyter notebooks
β”œβ”€β”€ docs/                       # Documentation
β”œβ”€β”€ main.py                     # Main entry point
β”œβ”€β”€ environment.yml             # Conda environment
β”œβ”€β”€ pyproject.toml             # Python project configuration
└── README.md                  # This file

πŸš€ Installation

Option 1: Using uv (Recommended)

# Clone the repository
git clone https://github.com/charlieJ107/huggingface-template.git
cd huggingface-template

# Install using uv (recommended)
uv sync

Option 2: Using Conda

# Clone the repository
git clone https://github.com/charlieJ107/huggingface-template.git
cd huggingface-template

# Create and activate conda environment
conda env create -f environment.yml
conda activate my-project

Devcontainer support

You may also use devcontainer to create your environment. Please check .devcontainer directory for details.

🎯 Quick Start

1. Basic Usage

# Run the main script
python main.py

2. Training a Model

# Train with default configuration
python training/pretrained_model/train.py

# Train with custom configuration
python training/pretrained_model/train.py --config config/custom_training_args.yaml

3. Using Jupyter Notebooks

# Start Jupyter
jupyter notebook

# Navigate to notebooks/ directory for examples

βš™οΈ Configuration

Training Arguments

Edit config/training_args.yaml to customize training parameters:

# Key training parameters
output_dir: "./outputs"
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
eval_strategy: "epoch"
save_strategy: "epoch"
logging_steps: 100
num_train_epochs: 3
learning_rate: 5e-5
warmup_steps: 500

Accelerate Configuration

Configure distributed training in config/accelerate_config.yaml:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 2
gpu_ids: [0, 1]
mixed_precision: fp16

You can also run accelerate CLI to make your configurations.

accelerate config

πŸ“– Usage

Custom Models

  1. Create Model Configuration:

    # models/your_model/your_model_config.py
    from transformers import PretrainedConfig
    
    class YourModelConfig(PretrainedConfig):
        model_type = "your_model"
        
        def __init__(self, vocab_size=30522, **kwargs):
            super().__init__(**kwargs)
            self.vocab_size = vocab_size
  2. Implement Model:

    # models/your_model/your_model.py
    from transformers import PreTrainedModel
    from .your_model_config import YourModelConfig
    
    class YourModel(PreTrainedModel):
        config_class = YourModelConfig
        
        def __init__(self, config):
            super().__init__(config)
            # Your model implementation

Custom Datasets

# datasets/your_dataset.py
from torch.utils.data import Dataset

class YourDataset(Dataset):
    def __init__(self, data_path):
        # Load your data
        pass
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # Return processed sample
        return sample

Training Scripts

# training/your_model/train.py
from transformers import TrainingArguments, Trainer
from utils import load_training_args

# Load configuration
args = load_training_args("config/training_args.yaml")
training_args = TrainingArguments(**args)

# Initialize model, dataset, trainer
model = YourModel.from_pretrained("your-model-name")
train_dataset = YourDataset("data/train")
eval_dataset = YourDataset("data/eval")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Start training
trainer.train()

πŸ”§ Development

Adding New Components

  1. Models: Add to models/ directory with config and implementation
  2. Datasets: Add to datasets/ directory with custom Dataset classes
  3. Training: Add training scripts to training/ directory
  4. Processing: Add data processors to processing/ directory

Code Style

This project follows PEP 8 coding standards. Use tools like black and flake8 for code formatting and linting.

Testing

# Run tests
python -m pytest tests/

# Run specific test
python -m pytest tests/test_models.py

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ™ Acknowledgments

  • Hugging Face for the amazing Transformers library
  • PyTorch for the deep learning framework
  • The open-source community for inspiration and contributions

Happy coding! πŸš€

About

A comprehensive template for training and evaluating deep learning models using PyTorch and the Hugging Face ecosystem. This template provides a well-structured foundation for NLP and multimodal projects with support for custom models, datasets, and training configurations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •