A comprehensive template for training and evaluating deep learning models using PyTorch and the Hugging Face ecosystem. This template provides a well-structured foundation for NLP and multimodal projects with support for custom models, datasets, and training configurations.
- ποΈ Modular Architecture: Clean separation of models, datasets, training, and utilities
- π€ Hugging Face Integration: Built-in support for Transformers, Datasets, and Accelerate
- β‘ Distributed Training: Multi-GPU and multi-node training with Accelerate
- π Comprehensive Logging: Built-in experiment tracking and visualization
- π§ Flexible Configuration: YAML-based configuration system
- π¦ Easy Deployment: Support for both conda and pip environments
- π§ͺ Testing Framework: Structured testing and evaluation pipeline
- π Jupyter Support: Interactive development with notebook examples
huggingface-template/
βββ config/ # Configuration files
β βββ accelerate_config.yaml # Accelerate configuration
β βββ training_args.yaml # Training arguments
βββ data/ # Data directories
β βββ raw/ # Raw data
β βββ processed/ # Processed data
β βββ interim/ # Intermediate data
β βββ external/ # External data sources
βββ datasets/ # Dataset implementations
β βββ example_dataset.py # Example dataset class
βββ models/ # Model implementations
β βββ pretrained_model/ # Custom pretrained models
β β βββ pretrained_model.py
β β βββ pretrained_model_config.py
β βββ other/ # Other model architectures
βββ training/ # Training scripts
β βββ pretrained_model/ # Training scripts for pretrained models
β β βββ train.py
β βββ other/ # Other training scripts
βββ processing/ # Data processing utilities
β βββ my_processor.py # Custom processor implementation
βββ utils/ # Utility functions
β βββ __init__.py
β βββ training_args.py # Training argument utilities
βββ visualization/ # Visualization utilities
β βββ visualization.py
βββ notebooks/ # Jupyter notebooks
βββ docs/ # Documentation
βββ main.py # Main entry point
βββ environment.yml # Conda environment
βββ pyproject.toml # Python project configuration
βββ README.md # This file
# Clone the repository
git clone https://github.com/charlieJ107/huggingface-template.git
cd huggingface-template
# Install using uv (recommended)
uv sync
# Clone the repository
git clone https://github.com/charlieJ107/huggingface-template.git
cd huggingface-template
# Create and activate conda environment
conda env create -f environment.yml
conda activate my-project
You may also use devcontainer to create your environment. Please check .devcontainer
directory for details.
# Run the main script
python main.py
# Train with default configuration
python training/pretrained_model/train.py
# Train with custom configuration
python training/pretrained_model/train.py --config config/custom_training_args.yaml
# Start Jupyter
jupyter notebook
# Navigate to notebooks/ directory for examples
Edit config/training_args.yaml
to customize training parameters:
# Key training parameters
output_dir: "./outputs"
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
eval_strategy: "epoch"
save_strategy: "epoch"
logging_steps: 100
num_train_epochs: 3
learning_rate: 5e-5
warmup_steps: 500
Configure distributed training in config/accelerate_config.yaml
:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 2
gpu_ids: [0, 1]
mixed_precision: fp16
You can also run accelerate CLI to make your configurations.
accelerate config
-
Create Model Configuration:
# models/your_model/your_model_config.py from transformers import PretrainedConfig class YourModelConfig(PretrainedConfig): model_type = "your_model" def __init__(self, vocab_size=30522, **kwargs): super().__init__(**kwargs) self.vocab_size = vocab_size
-
Implement Model:
# models/your_model/your_model.py from transformers import PreTrainedModel from .your_model_config import YourModelConfig class YourModel(PreTrainedModel): config_class = YourModelConfig def __init__(self, config): super().__init__(config) # Your model implementation
# datasets/your_dataset.py
from torch.utils.data import Dataset
class YourDataset(Dataset):
def __init__(self, data_path):
# Load your data
pass
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
# Return processed sample
return sample
# training/your_model/train.py
from transformers import TrainingArguments, Trainer
from utils import load_training_args
# Load configuration
args = load_training_args("config/training_args.yaml")
training_args = TrainingArguments(**args)
# Initialize model, dataset, trainer
model = YourModel.from_pretrained("your-model-name")
train_dataset = YourDataset("data/train")
eval_dataset = YourDataset("data/eval")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
# Start training
trainer.train()
- Models: Add to
models/
directory with config and implementation - Datasets: Add to
datasets/
directory with custom Dataset classes - Training: Add training scripts to
training/
directory - Processing: Add data processors to
processing/
directory
This project follows PEP 8 coding standards. Use tools like black
and flake8
for code formatting and linting.
# Run tests
python -m pytest tests/
# Run specific test
python -m pytest tests/test_models.py
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Hugging Face for the amazing Transformers library
- PyTorch for the deep learning framework
- The open-source community for inspiration and contributions
Happy coding! π