An end-to-end machine learning pipeline built on AWS SageMaker Pipelines, designed to support parallel model development and batch scoring on distributed, containerized infrastructure.
This project demonstrates the use of SageMaker Pipelines to operationalize a machine learning workflow that includes:
- Feature engineering
- Model training with XGBoost
- Model evaluation based on MSE threshold
- Conditional model registration
- Offline batch scoring using SageMaker Batch Transform
Ideal for MLOps teams looking to streamline experimentation, ensure consistency in deployment workflows, and scale processing across compute instances.
- Parameters:
Stage | Description |
---|---|
Processing |
Executes preprocessing.py to clean and split data |
Training |
Trains XGBoost model on training set |
Evaluation |
Evaluates model against validation set using MSE |
Register Model |
Saves model if MSE < threshold |
Batch Transform |
Scores batch data using newly trained model |
With the learnings from this experiment, we successfully implemented parallel model development and scoring pipelines for four models—supporting both Purchase and Refinance scenarios in production.
-->Clone the repo: git clone https://github.com/krishnamami/Distributed_ML_Sagemaker_Pipelines.git
-->pip install -r requirements.txt
-->python sage_maker_pipeline.py
Author Krishna Goud
Head of Data Engineering & MLOps | Rocket LA LinkedIn