Skip to content

Machine learning project to predict California house prices using regression models on census data. Includes data download, preprocessing, and model training.

License

Notifications You must be signed in to change notification settings

ngusadeep/house-price-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

House Price Prediction

Introduction

The House Price Prediction project is a supervised machine learning task aimed at estimating the median value of residential homes in California districts. Using data derived from the 1990 U.S. Census, the project applies regression techniques to learn relationships between various housing and demographic features and the target variable: median house price.

Dataset Description

The dataset contains aggregated data for California districts, including:

  • Median Income: The median income of residents in the district.
  • House Age: Average age of the houses.
  • Rooms and Bedrooms: Average number of rooms and bedrooms per house.
  • Population and Households: Number of people and households in the district.
  • Geographical Coordinates: Latitude and longitude indicating the district’s location.
  • Median House Value: The target variable representing the median price of houses.

Methodology

Data Acquisition and Preparation

The raw dataset is programmatically downloaded and extracted to ensure reproducibility. Initial exploration includes inspecting data types, missing values, and basic statistics.

Exploratory Data Analysis (EDA)

The project involves visualizing feature distributions and their correlations with the target variable. Geographic plotting highlights spatial trends in housing prices.

Feature Engineering and Preprocessing

New features, such as rooms-per-household or bedrooms-per-room ratios, are derived to enhance predictive power. Categorical features are encoded, and numerical features are scaled or normalized as appropriate.

Model Development

Multiple regression models are evaluated, including:

  • Linear Regression
  • Decision Tree Regressor
  • Random Forest Regressor

Models are trained on a training set and evaluated on a hold-out test set to measure their generalization performance.

Model Evaluation

Performance metrics such as Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) quantify the accuracy of predictions. Cross-validation techniques help tune hyperparameters and prevent overfitting.

Results

The Random Forest Regressor provided the best balance between bias and variance, achieving an RMSE of approximately [your result here]. Visualization of predictions versus actual prices confirms the model’s ability to capture key trends in the housing market.

Conclusion

This project successfully demonstrates the process of building a regression model for predicting house prices. It encompasses data acquisition, cleaning, feature engineering, model training, evaluation, and interpretation — providing a solid foundation for further enhancements such as incorporating additional data sources or advanced algorithms.

How to Use

  1. Download and extract the data using the provided script:
from fetch_data import fetch_housing_data
fetch_housing_data()
  1. Load the dataset into a pandas DataFrame:
from fetch_data import load_housing_data
housing = load_housing_data()
  1. Proceed with exploration, preprocessing, modeling, and evaluation.

Dependencies

  • Python 3.7+
  • pandas
  • numpy
  • matplotlib
  • scikit-learn
  • jupyter (optional)

Install dependencies with:

pip install pandas numpy matplotlib scikit-learn jupyter

Project Structure

├── datasets/
│   └── housing/                    # Dataset files after download and extraction
├── fetch_data.py                  # Data download and loading utility functions
├── house_price_prediction.ipynb  # Jupyter notebook with exploration and modeling
├── README.md                     # Project overview and instructions

References

Dataset source and inspiration from: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd Edition) by Aurélien Géron

About

Machine learning project to predict California house prices using regression models on census data. Includes data download, preprocessing, and model training.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published