Classification model: Stroke Prediction

A model for predicting the risk of stroke in a patient

This project was developed to determine the likelihood of a patient having a stroke. The interactive interface is based on Streamlit, which allows you to easily interact with the model and analyze the results.

Description

1. Project objective: To create an analytical tool to study the risk of stroke in a patient, taking into account various factors

2. Project objectives:

Data analysis: to identify the factors that influence to the risk of stroke in a patient.

data analysis: to identify key factors that influence the risk of stroke;
model Building: use machine learning and statistical analysis to create a model that can predict stroke risk;
user Interface: develop an interactive interface that allows users to enter new data, analyze the results, and make predictions based on the model.

Technologies

The project was implemented using the following technologies:

Python: the main programming language

Libraries

Pandas: for data processing;
Numpy: for numerical calculations;
Scikit-learn: for building and evaluating machine learning models;
Imbalanced-learn: provides tools when dealing with classification with imbalanced classes;
Matplotlib and Seaborn: for data visualization;
Streamlit: for creating an interactive interface;
Joblib: for efficient serialization (saving) and loading of Python objects.

Dataset

The dataset used for this project has the following characteristics:

https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
format: .csv;
contains the following key columns: gender, age, hypertension, stroke, etc.

This dataset contains 5110 rows with 12 different characteristics:

id: unique identifier;
gender: the patient's gender ("Male", "Female" or "Other");
age: patient's age;
hypertension: the presence of hypertension (0 - no, 1 - yes);
heart_disease: presence of heart disease (0 - no, 1 - yes);
ever_married: marriage status ("No" or "Yes");
work_type: type of work ("children", "Govt_jov", "Never_worked", "Private" or "Self-employed");
Residence_type: type of residence ("Rural" or "Urban");
avg_glucose_level: average blood glucose level;
bmi: body mass index;
smoking_status: smoking status ("formerly smoked", "never smoked", "smokes" or "Unknown");
stroke: whether a stroke has occurred (0 - no, 1 - yes).

Correlation analysis

Correlation results:

❌ Positive association with stroke risk:

age: 0.245239 - older people have a higher risk;

heart_disease: 0.134905 - people with heart disease have a higher risk of stroke;

hypertension: 0.127891 - people with hypertension have a higher risk of stroke;

avg_glucose_level: 0.131991 - higher glucose levels are associated with a higher risk of stroke;

ever_married_Yes: 0.108299 - married people have a slightly higher risk of stroke;

smoking_status_formerly smoked: 0.064683 - former smokers have a higher risk of stroke;

work_type_Self-employed: 0.062150 - self-employed persons have a higher risk of stroke;

bmi: 0.036075 - higher body mass index is associated with higher risk of stroke;

❗Positive but very weak association:

Residence_type_Urban: 0.015415 - urban residence;

work_type_Private: 0.011927 - work in the private sector;

gender_Male: 0.009081 - male gender;

smoking_status_smokes: 0.008920 - smokers;

work_type_Govt_job: 0.002660 - work in the civil service;

❎ Negative but very weak relationship:

smoking_status_never smoked: -0.004163 - people who have never smoked;

gender_Female: -0.009081 - female gender;

work_type_Never_worked: -0.014885 - people who have never worked;

Residence_type_Rural: -0.015415 - rural residence;

✅ Negative_relationship:

smoking_status_Unknown: -0.055924 - people with unknown smoking status;

work_type_children: -0.083888 - children have a lower risk of stroke;

ever_married_No: -0.108299 - unmarried people have a lower risk of stroke.

Model Comparison

Model	Dataset	Precision	Recall	F1_score	Accuracy
Logistic Regression	Train	0.142370	0.839196	0.243440	0.746024
Decision Tree	Train	0.443207	1.000000	0.614198	0.938830
Random Forest	Train	0.503856	0.984925	0.666667	0.952043
Gradient Boosting	Train	0.991736	0.603015	0.750000	0.980426
Balanced Random Forest	Train	0.208814	1.000000	0.345486	0.815513
Logistic Regression	Test	0.130282	0.740000	0.221557	0.745597
Decision Tree	Test	0.141593	0.320000	0.196319	0.871820
Random Forest	Test	0.171053	0.260000	0.206349	0.902153
Gradient Boosting	Test	0.222222	0.040000	0.067797	0.946184
Balanced Random Forest	Test	0.153846	0.720000	0.253521	0.792564

Confusion matrix LogisticRegression

Confusion matrix BalancedRandomForestClassifier

💓Summing up

LogisticRegression and BalancedRandomForestClassifier show the highest sensitivity on the test dataset. However, considering also other metrics such as precision and F1-measure, the BalancedRandomForestClassifier may be a more suitable choice for this task.

If the primary goal is not to miss patients, i.e. minimize false negatives, then the key metric is Recall, namely the sensitivity for class 1.

Recall determines the proportion of correctly identified patients among all valid patients, which is critical if we are more concerned about a situation where a sick patient is classified as healthy.

The main metric is Recall for a positive class, i.e. minimizing false negatives, and from the presented results, LogisticRegression demonstrates the highest recall for class 1 - 0.74 on the test data, BalancedRandomForestClassifier - 0.72.

This means that 74% and 72% of sick patients, respectively, were correctly identified, which is critical if a diagnostic error can lead to a patient being classified as healthy.

However, it is important to remember that a high level of memorization can be accompanied by low accuracy - a large number of false positives. In my case, the accuracy for class 1 remains low for LogisticRegression - 0.13 and for BalancedRandomForestClassifier - 0.15.

✅ Therefore, if the main goal is not to miss sick patients, then LogisticRegression is the most appropriate model.

Run locally

Clone the repository:

git clone https://github.com/MariiaSam/Stroke-Prediction.git
cd Stroke-Prediction

Set up the virtual environment with Poetry

Set up project dependencies:

poetry install

To activate the virtual environment, run the command:

poetry shell

To add a dependency to a project, run the command:

poetry add <package_name>

To pull in existing dependencies:

poetry install

Using

Run the Streamlit application with the command:

streamlit run app.py

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
images		images
model		model
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
app.py		app.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Classification model: Stroke Prediction

A model for predicting the risk of stroke in a patient

Description

Technologies

Libraries

Dataset

Correlation results:

Model Comparison

💓Summing up

Run locally

Using

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MariiaSam/Stroke-Prediction

Folders and files

Latest commit

History

Repository files navigation

Classification model: Stroke Prediction

A model for predicting the risk of stroke in a patient

Description

Technologies

Libraries

Dataset

Correlation results:

Model Comparison

💓Summing up

Run locally

Using

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages