MLOps Project: U.S. Visa Status Prediction

This project is a complete MLOps implementation for a U.S. Visa Status Prediction system. It includes a full CI/CD pipeline for automated training, deployment, and monitoring of a machine learning model. The model predicts whether a visa application will be approved or denied based on various applicant attributes.

Project Overview

The primary goal of this project is to build an end-to-end machine learning system that can predict the outcome of U.S. visa applications. It demonstrates best practices in MLOps, including:

Modular Code Structure: The project is organized into distinct components for each step of the ML lifecycle.
Automated Pipelines: Both training and prediction are handled by automated pipelines.
CI/CD Integration: Continuous Integration and Continuous Deployment are set up using GitHub Actions to automatically train and deploy the model on every push to the main branch.
Cloud Integration: The system uses MongoDB for data storage and AWS S3/Cloudflare R2 for storing model artifacts.
Model Monitoring: The pipeline includes a data validation step to detect data drift between training and new data.

Workflow Architecture

The project is composed of two main pipelines: the Training Pipeline and the Prediction Pipeline. The CI/CD workflow automates the execution of the training pipeline and the deployment of the prediction service.

Technology Stack

Programming Language: Python 3.8+
Web Framework: Flask
ML Libraries: Scikit-learn, Pandas, NumPy, CatBoost
Database: MongoDB
Cloud Storage: AWS S3, Cloudflare R2
CI/CD: GitHub Actions
Containerization: Docker
Data Validation: Evidently AI

Project Structure

The project follows a modular structure to ensure scalability and maintainability.

├── .github/workflows/         # CI/CD pipeline configuration
│   └── cicd.yml
├── config/                    # Configuration files
│   ├── model.yaml
│   └── schema.yaml
├── notebook/                  # Jupyter notebooks for experimentation
│   ├── EDA_US_Visa.ipynb
│   └── ...
├── static/                    # Static files for web UI
│   └── css/style.css
├── templates/                 # HTML templates for Flask app
│   └── index.html
├── usa_visa/                  # Core source code package
│   ├── __init__.py
│   ├── components/            # Modules for each pipeline stage
│   │   ├── __init__.py
│   │   ├── data_ingestion.py
│   │   ├── data_validation.py
│   │   ├── data_transformation.py
│   │   ├── model_trainer.py
│   │   ├── model_evaluation.py
│   │   └── model_pusher.py
│   ├── configuration/         # Configuration for external services
│   │   ├── __init__.py
│   │   ├── aws_connection.py
│   │   └── mongo_db_connection.py
│   ├── constants/             # Project-wide constants
│   │   └── __init__.py
│   ├── data_access/           # Data access layer for MongoDB
│   │   ├── __init__.py
│   │   └── usvisa_data.py
│   ├── entity/                # Entity definitions (data classes)
│   │   ├── __init__.py
│   │   ├── artifact_entity.py
│   │   └── config_entity.py
│   ├── exception/             # Custom exception handling
│   │   └── __init__.py
│   ├── logger/                # Custom logging
│   │   └── __init__.py
│   ├── pipeline/              # Pipeline orchestration
│   │   ├── __init__.py
│   │   ├── training_pipeline.py
│   │   └── prediction_pipeline.py
│   └── utils/                 # Utility functions
│       ├── __init__.py
│       └── main_utils.py
├── .gitignore
├── app.py                     # Main Flask application
├── demo.py                    # Script to run training pipeline
├── Dockerfile
├── LICENSE
├── requirements.txt
└── setup.py                   # Setup script for the package

ML Pipeline Stages

The training pipeline is a sequence of components, each responsible for a specific task.

1. Data Ingestion

This stage is responsible for fetching the dataset from the source (MongoDB), splitting it into training and testing sets, and storing them as artifacts for the next stages.

2. Data Validation

This stage validates the ingested data to ensure its quality and integrity. It performs two key checks:

Schema Validation: Verifies that the data conforms to a predefined schema (schema.yaml), checking for correct column names, data types, and value ranges.
Data Drift Detection: Compares the statistical properties of the new training data with a baseline (or previous batch) to detect any significant changes (drift). This is crucial for maintaining model performance over time.

3. Data Transformation

This stage preprocesses the validated data to make it suitable for model training. It involves applying a series of transformations like one-hot encoding for categorical features and scaling for numerical features. The transformation pipeline object is saved to be used later during prediction.

4. Model Trainer

This stage takes the transformed data and trains a machine learning model. It uses the CatBoost Classifier algorithm. After training, the model is saved as a .pkl file.

5. Model Evaluation

This stage evaluates the newly trained model's performance against the model currently in production. If no production model exists, the new model is accepted by default. If a production model exists, the new model is accepted only if its performance (e.g., F1-score) is better by a predefined margin.

6. Model Pusher

If the model evaluation stage accepts the new model, this stage pushes the model artifact to a production-ready cloud storage location (AWS S3 or Cloudflare R2). This makes the model available for the prediction service.

CI/CD Pipeline

The project is configured with a CI/CD pipeline using GitHub Actions (.github/workflows/cicd.yml). This pipeline automates the entire process from code commit to deployment.

Workflow:

Trigger: The workflow is triggered on every push to the main branch.
Setup: It sets up a virtual environment with the required Python version.
Install Dependencies: It installs all the project dependencies from requirements.txt.
Run Training Pipeline: It executes the entire ML training pipeline by running the demo.py script. This ensures the code is working and produces a new model if necessary.
Build Docker Image: It builds a Docker image of the Flask application.
Push to Docker Hub: It pushes the newly built image to Docker Hub, making it ready for deployment.

This automated process ensures that every change is tested and a new, potentially better model is deployed seamlessly.

How to Run

Prerequisites

Python 3.8 or higher
MongoDB account and connection URI
AWS account with S3 bucket (or Cloudflare account with R2 bucket)
Docker installed locally (for building the image)

Setup

Clone the repository:

git clone [https://github.com/your-username/mlops-us-visa-status.git](https://github.com/your-username/mlops-us-visa-status.git)
cd mlops-us-visa-status

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Set up Environment Variables: Create a .env file in the root directory and add your credentials. The application will automatically load them.

MONGO_DB_URL="your_mongodb_uri"
AWS_ACCESS_KEY_ID="your_aws_access_key"
AWS_SECRET_ACCESS_KEY="your_aws_secret_key"
# Or for Cloudflare R2
CLOUDFLARE_ENDPOINT_URL="your_r2_endpoint"
CLOUDFLARE_ACCESS_KEY_ID="your_r2_access_key"
CLOUDFLARE_SECRET_ACCESS_KEY="your_r2_secret_key"

Running the Training Pipeline

To run the training pipeline manually, execute the demo.py script:

python demo.py

This will run all the stages from data ingestion to model pusher and save the artifacts in the artifact/ directory.

Running the Prediction Server

To start the Flask server for predictions, run the app.py script:

python app.py

The server will start on http://127.0.0.1:8080.

API Endpoints

The Flask application provides the following endpoints:

GET /: Renders a simple web UI to input data and get a prediction.
POST /predict: The main prediction endpoint. It accepts JSON data with the applicant's information and returns the visa status prediction.
GET /train: Triggers a new training pipeline run.

Example curl request for prediction:

curl -X POST [http://127.0.0.1:8080/predict](http://127.0.0.1:8080/predict) \
-H "Content-Type: application/json" \
-d '{
      "continent": "Asia",
      "education_of_employee": "Master''s",
      "has_job_experience": "Y",
      "requires_job_training": "N",
      "no_of_employees": 100,
      "company_age": 20,
      "region_of_employment": "West",
      "prevailing_wage": 120000,
      "unit_of_wage": "Year",
      "full_time_position": "Y",
      "case_status": "Certified"
    }'

🔗 Connect With Me:

🌐 Website: aydie.in
👨‍💻 Coding Profiles: - LeetCode - HackerRank
📸 Instagram: @aydiemusic
💼 LinkedIn: linkedin.com/in/aydiemusic
🎵 Music Channel: YouTube - Aydie Music

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLOps Project: U.S. Visa Status Prediction

Table of Contents

Project Overview

Workflow Architecture

Technology Stack

Project Structure

ML Pipeline Stages

1. Data Ingestion

2. Data Validation

3. Data Transformation

4. Model Trainer

5. Model Evaluation

6. Model Pusher

CI/CD Pipeline

How to Run

Prerequisites

Setup

Running the Training Pipeline

Running the Prediction Server

API Endpoints

🔗 Connect With Me:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
artifact		artifact
config		config
flowcharts		flowcharts
logs		logs
notebook		notebook
static		static
templates		templates
us_visa.egg-info		us_visa.egg-info
usa_visa		usa_visa
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
demo.py		demo.py
docs.txt		docs.txt
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py

License

aydiegithub/mlops-us-visa-status

Folders and files

Latest commit

History

Repository files navigation

MLOps Project: U.S. Visa Status Prediction

Table of Contents

Project Overview

Workflow Architecture

Technology Stack

Project Structure

ML Pipeline Stages

1. Data Ingestion

2. Data Validation

3. Data Transformation

4. Model Trainer

5. Model Evaluation

6. Model Pusher

CI/CD Pipeline

How to Run

Prerequisites

Setup

Running the Training Pipeline

Running the Prediction Server

API Endpoints

🔗 Connect With Me:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages