Skip to content

jalhane88/football-prediction-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Football Match Outcome Prediction & Market Efficiency Analysis (EPL 2021-2022)

Sections

I. Objective II. Data III. Methodology / Pipeline IV. Results & Discussion V. Conclusion VI. Future Work VII. How to Run

I. Objective

The primary goal of this project was to develop and evaluate machine learning models for predicting the outcome (Home Win, Draw, Away Win) of English Premier League (EPL) football matches using historical data from the 2021-2022 season.

Secondary objectives included:

  1. Analyzing the efficiency and calibration of readily available betting market odds.
  2. Engineering relevant features reflecting team form, season-long performance, and underlying match statistics.
  3. Comparing the performance of different modeling approaches (Logistic Regression, XGBoost).
  4. Investigating whether models combining engineered features with market odds could produce probability estimates potentially better calibrated than the market average, identifying possible value betting scenarios.

II. Data

The analysis utilized publicly available match data for the English Premier League (EPL) 2021-2022 season.

The dataset was sourced from [Football-Data.co.uk].

Key information within the dataset included:

  • Match Details: Date, Home Team, Away Team.
  • Results: Full-Time Home/Away Goals (FTHG, FTAG), Full-Time Result (FTR - H/D/A).
  • Match Statistics: Shots (HS, AS), Shots on Target (HST, AST), Corners (HC, AC), Fouls (HF, AF), Cards (Yellow/Red). Stats used for our analysis and feature engineering (V2 features() were - Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HS, AS, HST, AST, HC, AC).
  • Betting Odds: Closing odds from various bookmakers for Home Win, Draw, and Away Win outcomes (e.g., Bet365, Pinnacle, Average Odds). (Mention other odds types used if applicable, like Over/Under or AH).

III. Methodology / Pipeline

The project followed a standard data science workflow, implemented primarily using Python, Pandas, Scikit-learn, and XGBoost. The main steps were:

  1. Data Loading & Initial Cleaning:

    • Loaded the raw EPL 2021-2022 match data CSV.
    • Converted date columns to the appropriate datetime format.
    • Performed initial inspection of data types and missing values.
    • (Code Reference: Initial steps in notebooks/main_workflow.ipynb or potentially src/data_loader.py if created).
  2. Exploratory Data Analysis (EDA):

    • Analyzed the distribution of match outcomes (Home Win, Draw, Away Win).
    • Calculated implied probabilities from average bookmaker odds (AvgH, AvgD, AvgA).
    • Assessed betting market efficiency by:
      • Visualizing implied probabilities vs. actual outcome frequencies (calibration plots).
      • Calculating Brier scores to quantify probability calibration.
      • Checking for systematic biases (e.g., favorite-longshot bias).
    • Key Finding from EDA: The betting market odds were found to be reasonably well-calibrated on average (Avg Brier Score ~0.19), indicating a relatively efficient market, though not perfectly predictive.
    • (Code Reference: notebooks/eda_analysis.ipynb).
  3. Feature Engineering:

    • Rationale: To provide models with richer information beyond just raw odds, capturing team momentum, season-long strength, and underlying performance metrics.
    • Implementation: Defined reusable functions for feature creation. (Code Reference: src/feature_engineering.py).
    • Features Created:
      • Implied Probabilities: Normalized probabilities derived from average odds.
      • Rolling Features (Momentum): Calculated metrics over the previous 5 matches (N=5) for each team, including: Points Sum, Goal Difference Sum, Average Goals (For/Against), Average Shots (For/Against), Average Shots on Target (SoT) (For/Against), Average Corners (For/Against).
      • Cumulative Features (Season Strength): Calculated season-to-date cumulative metrics for each team prior to the current match, including: Points, Goal Difference, Goals (F/A), Shots (F/A), SoT (F/A), Corners (F/A), Matches Played.
      • Difference Features: Calculated the difference between the home team's and away team's values for all engineered rolling and cumulative features (e.g., Cumul_Points_Diff, Avg_STFor_L5_Diff).
    • (Code Reference for Execution: Called sequentially within notebooks/main_workflow.ipynb).
  4. Modeling & Evaluation:

    • Data Splitting: Implemented a chronological train/test split (80% train, 20% test) based on match date to prevent data leakage and simulate real-world prediction.
    • Models Tested:
      • Logistic Regression: Used as a baseline, tested with different feature sets (form only, form + odds, V2 features). Feature scaling (StandardScaler) was applied.
      • XGBoost: Tested as a more complex, non-linear model with default hyperparameters, using feature sets including odds.
    • Evaluation: Assessed models on the unseen test set using:
      • Accuracy: Overall correct prediction rate.
      • Classification Report: Precision, Recall, F1-score per class (H/D/A).
      • Log Loss: Metric penalizing confident wrong probability predictions.
      • Brier Score: Measured the accuracy (calibration) of predicted probabilities, compared against the market average Brier score.
    • Probability Analysis: Compared the best model's predicted probabilities against bookmaker implied probabilities to identify systematic disagreements and potential value.
    • (Code Reference: notebooks/main_workflow.ipynb).

IV. Results & Discussion

Several models were trained and evaluated on the chronologically held-out test set (final 20% of the 2021-2022 season). Key findings include:

  1. Baseline Performance: Models using only simple rolling form features (without odds) performed poorly, barely exceeding random chance and yielding probability predictions worse than the market average.

  2. Impact of Market Odds: Incorporating bookmaker implied probabilities (Implied_H/D/A) as features provided a significant performance boost, particularly for the Logistic Regression model.

    • The best Logistic Regression model (M2: simple form differences + odds) achieved a test set Accuracy of ~57.3% and a Log Loss of ~0.871.
  3. Probability Calibration: The most notable result was the improvement in probability calibration achieved by the M2 Logistic Regression model.

    • It achieved an Average Brier Score of ~0.173 on the test set.
    • This was lower (better) than the average Brier score estimated for the raw bookmaker odds (~0.19), suggesting the model's probabilities, derived from combining odds and form, were better calibrated overall. The improvement was particularly noticeable for Away win predictions.
  4. Draw Prediction Difficulty: A consistent challenge across models was the prediction of draws.

    • The best performing Logistic Regression models (M2 and V2 features) failed to predict any draws in the test set, focusing instead on distinguishing Home vs. Away wins.
    • Attempts to force draw predictions using class_weight='balanced' in Logistic Regression successfully produced some draw predictions but significantly degraded overall accuracy and probability calibration.
    • The default XGBoost models also struggled, either predicting draws very poorly or not at all.
  5. Impact of Enhanced Features (V2): Adding more detailed features (cumulative stats, rolling averages of shots, SoT, corners, etc.) did not lead to improved performance for either Logistic Regression or default XGBoost on the test set compared to the simpler M2 model.

    • LogReg V2 Accuracy was slightly higher (~59.2%) but Log Loss (~0.890) and Brier Score (~0.175) were slightly worse than LogReg M2.
    • XGBoost V2 performed significantly worse than LogReg M2 on all key metrics (Accuracy ~54.0%, Log Loss ~1.021, Brier Score ~0.199).
  6. Feature Importance: In models that included odds, the Implied_H, Implied_D, and Implied_A features consistently showed the highest importance, confirming the strong predictive power of the betting market. Form and statistical features provided secondary contributions.

  7. Probability Analysis (Value Identification): Analyzing the probability differences between the best model (LogReg M2) and the market revealed potential value:

    • In matches where the model's predicted probability for a Home or Away win significantly exceeded the bookmaker's implied probability (e.g., by >10%), the actual frequency of those outcomes occurring was much closer to the model's higher estimate than the bookie's lower estimate.
    • This suggests the model, despite its simplicity, was identifying subsets of potentially undervalued Home and Away teams based on the combination of odds and form features.

Discussion Summary: The results highlight the efficiency of the betting market, making it hard to beat with simple models. However, combining market odds with even basic form features within a Logistic Regression framework yielded probabilities potentially better calibrated than the market average. While discrete outcome prediction (especially for draws) remained challenging, the analysis suggests the model's probability outputs could offer valuable insights beyond raw market odds, particularly for identifying potentially mispriced Home/Away win likelihoods. The lack of improvement from more complex V2 features (with default models) suggests either the need for more sophisticated feature engineering/selection, hyperparameter tuning, or that the primary signal is already well-captured by the odds and basic form differences.

V. Conclusion

This project demonstrated the process of building predictive models for EPL match outcomes and analyzing betting market efficiency using data from the 2021-2022 season.

The key conclusions are:

  1. Market Efficiency: Average bookmaker odds provide a strong baseline prediction and are reasonably well-calibrated, confirming the general efficiency of the betting market.
  2. Value of Odds in Modeling: Incorporating implied probabilities derived from market odds significantly enhances model performance compared to using statistical/form features alone.
  3. Calibration Improvement: A relatively simple Logistic Regression model combining market odds with basic rolling form difference features was capable of producing probability estimates (P(H), P(D), P(A)) that demonstrated better average calibration (lower Brier Score) than the raw market odds on the test set.
  4. Potential Value Identification: Analysis suggested this best model could identify subsets of matches where the market might be undervaluing Home or Away win probabilities, although this requires validation on a larger dataset.
  5. Modeling Challenges: Predicting the discrete match outcome accurately remains difficult, particularly for draws, even with enhanced features or more complex models like XGBoost (using default settings). Model complexity did not guarantee improved performance over a well-feature-engineered linear model in this case.

VI. Future Work

Several avenues could be explored to build upon this analysis:

  1. Extended Data & Validation: Test the developed models and feature engineering pipeline on multiple seasons of data (and potentially other leagues) to validate the findings, particularly the probability calibration improvements and value identification potential, on a larger, more robust dataset.
  2. Backtesting Betting Strategies: Implement a simulation framework using historical odds to quantitatively assess the potential profitability of strategies based on the discrepancies between the best model's probabilities and market odds (e.g., Kelly criterion, fixed staking).
  3. Hyperparameter Tuning: Perform systematic hyperparameter optimization (e.g., using GridSearchCV or RandomizedSearchCV) for models like XGBoost, especially with the richer V2 feature set, to potentially unlock better performance.
  4. Advanced Feature Engineering:
    • Explore alternative form metrics (e.g., exponentially weighted moving averages).
    • Incorporate team ratings (e.g., Elo, Glicko).
    • Engineer features based on match context (e.g., days since last match, importance of match).
    • Develop features specifically designed to capture draw likelihood.
  5. Alternative Modeling Approaches:
    • Investigate models specifically designed for ordered outcomes (if applicable) or specialized classification techniques.
    • Explore Poisson-based models that predict expected goals for each team rather than the direct H/D/A outcome.
    • Consider ensemble methods combining predictions from different models.
    • Consider metrics such as XG and derivatives, how they may be incorporated into features.
  6. Handling Draws: Research and implement specific techniques targeting the prediction of draws, such as specialized sampling methods or bespoke model architectures, if accurate draw prediction is a key requirement.

VII. How to Run

  1. Clone the Repository:
    git clone [Your Repository URL Here]
    cd your_project_name
  2. Create/Activate Virtual Environment:
    • It is highly recommended to use a virtual environment.
    • Create one (e.g., using venv):
      python -m venv football_env
    • Activate it:
      • Windows: .\football_env\Scripts\activate
      • macOS/Linux: source football_env/bin/activate
  3. Install Dependencies:
    pip install -r requirements.txt
  4. Place Data: Ensure the raw data file (epl_2021_2022.csv or your specific filename) is placed inside the data/ directory.
  5. Run the Workflow:
    • Navigate to the notebooks/ directory if desired.
    • Open and run the cells sequentially in notebooks/main_workflow.ipynb using a Jupyter Notebook environment (like VS Code's notebook interface, Jupyter Lab, or Jupyter Notebook). Ensure the kernel selected corresponds to the football_env virtual environment.
    • The notebook will load data, call feature engineering functions from src/, perform train/test splitting, train models, evaluate them, and perform the final probability analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published