Skip to content

MariiaSam/Stroke-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classification model: Stroke Prediction

A model for predicting the risk of stroke in a patient

This project was developed to determine the likelihood of a patient having a stroke. The interactive interface is based on Streamlit, which allows you to easily interact with the model and analyze the results.

brain

Description

1. Project objective: To create an analytical tool to study the risk of stroke in a patient, taking into account various factors

2. Project objectives:

Data analysis: to identify the factors that influence to the risk of stroke in a patient.

  • data analysis: to identify key factors that influence the risk of stroke;
  • model Building: use machine learning and statistical analysis to create a model that can predict stroke risk;
  • user Interface: develop an interactive interface that allows users to enter new data, analyze the results, and make predictions based on the model.

Technologies

The project was implemented using the following technologies:

  • Python: the main programming language

Libraries

  • Pandas: for data processing;
  • Numpy: for numerical calculations;
  • Scikit-learn: for building and evaluating machine learning models;
  • Imbalanced-learn: provides tools when dealing with classification with imbalanced classes;
  • Matplotlib and Seaborn: for data visualization;
  • Streamlit: for creating an interactive interface;
  • Joblib: for efficient serialization (saving) and loading of Python objects.

Dataset

The dataset used for this project has the following characteristics:

This dataset contains 5110 rows with 12 different characteristics:

  • id: unique identifier;
  • gender: the patient's gender ("Male", "Female" or "Other");
  • age: patient's age;
  • hypertension: the presence of hypertension (0 - no, 1 - yes);
  • heart_disease: presence of heart disease (0 - no, 1 - yes);
  • ever_married: marriage status ("No" or "Yes");
  • work_type: type of work ("children", "Govt_jov", "Never_worked", "Private" or "Self-employed");
  • Residence_type: type of residence ("Rural" or "Urban");
  • avg_glucose_level: average blood glucose level;
  • bmi: body mass index;
  • smoking_status: smoking status ("formerly smoked", "never smoked", "smokes" or "Unknown");
  • stroke: whether a stroke has occurred (0 - no, 1 - yes).

Correlation analysis

Correlation

Correlation results:

Positive association with stroke risk:

age: 0.245239 - older people have a higher risk;

heart_disease: 0.134905 - people with heart disease have a higher risk of stroke;

hypertension: 0.127891 - people with hypertension have a higher risk of stroke;

avg_glucose_level: 0.131991 - higher glucose levels are associated with a higher risk of stroke;

ever_married_Yes: 0.108299 - married people have a slightly higher risk of stroke;

smoking_status_formerly smoked: 0.064683 - former smokers have a higher risk of stroke;

work_type_Self-employed: 0.062150 - self-employed persons have a higher risk of stroke;

bmi: 0.036075 - higher body mass index is associated with higher risk of stroke;


Positive but very weak association:

Residence_type_Urban: 0.015415 - urban residence;

work_type_Private: 0.011927 - work in the private sector;

gender_Male: 0.009081 - male gender;

smoking_status_smokes: 0.008920 - smokers;

work_type_Govt_job: 0.002660 - work in the civil service;


Negative but very weak relationship:

smoking_status_never smoked: -0.004163 - people who have never smoked;

gender_Female: -0.009081 - female gender;

work_type_Never_worked: -0.014885 - people who have never worked;

Residence_type_Rural: -0.015415 - rural residence;


Negative_relationship:

smoking_status_Unknown: -0.055924 - people with unknown smoking status;

work_type_children: -0.083888 - children have a lower risk of stroke;

ever_married_No: -0.108299 - unmarried people have a lower risk of stroke.

Model Comparison

Model Dataset Precision Recall F1_score Accuracy
Logistic Regression Train 0.142370 0.839196 0.243440 0.746024
Decision Tree Train 0.443207 1.000000 0.614198 0.938830
Random Forest Train 0.503856 0.984925 0.666667 0.952043
Gradient Boosting Train 0.991736 0.603015 0.750000 0.980426
Balanced Random Forest Train 0.208814 1.000000 0.345486 0.815513
Logistic Regression Test 0.130282 0.740000 0.221557 0.745597
Decision Tree Test 0.141593 0.320000 0.196319 0.871820
Random Forest Test 0.171053 0.260000 0.206349 0.902153
Gradient Boosting Test 0.222222 0.040000 0.067797 0.946184
Balanced Random Forest Test 0.153846 0.720000 0.253521 0.792564

Confusion matrix LogisticRegression brain

Confusion matrix BalancedRandomForestClassifier brain

brain

brain

brain

brain

💓Summing up

LogisticRegression and BalancedRandomForestClassifier show the highest sensitivity on the test dataset. However, considering also other metrics such as precision and F1-measure, the BalancedRandomForestClassifier may be a more suitable choice for this task.

If the primary goal is not to miss patients, i.e. minimize false negatives, then the key metric is Recall, namely the sensitivity for class 1.

Recall determines the proportion of correctly identified patients among all valid patients, which is critical if we are more concerned about a situation where a sick patient is classified as healthy.

The main metric is Recall for a positive class, i.e. minimizing false negatives, and from the presented results, LogisticRegression demonstrates the highest recall for class 1 - 0.74 on the test data, BalancedRandomForestClassifier - 0.72.

This means that 74% and 72% of sick patients, respectively, were correctly identified, which is critical if a diagnostic error can lead to a patient being classified as healthy.

However, it is important to remember that a high level of memorization can be accompanied by low accuracy - a large number of false positives. In my case, the accuracy for class 1 remains low for LogisticRegression - 0.13 and for BalancedRandomForestClassifier - 0.15.

Therefore, if the main goal is not to miss sick patients, then LogisticRegression is the most appropriate model.

Run locally

Clone the repository:

git clone https://github.com/MariiaSam/Stroke-Prediction.git
cd Stroke-Prediction

Set up the virtual environment with Poetry

Set up project dependencies:

poetry install

To activate the virtual environment, run the command:

poetry shell

To add a dependency to a project, run the command:

poetry add <package_name>

To pull in existing dependencies:

poetry install

Using

Run the Streamlit application with the command:

streamlit run app.py