The House Price Prediction project is a supervised machine learning task aimed at estimating the median value of residential homes in California districts. Using data derived from the 1990 U.S. Census, the project applies regression techniques to learn relationships between various housing and demographic features and the target variable: median house price.
The dataset contains aggregated data for California districts, including:
- Median Income: The median income of residents in the district.
- House Age: Average age of the houses.
- Rooms and Bedrooms: Average number of rooms and bedrooms per house.
- Population and Households: Number of people and households in the district.
- Geographical Coordinates: Latitude and longitude indicating the district’s location.
- Median House Value: The target variable representing the median price of houses.
The raw dataset is programmatically downloaded and extracted to ensure reproducibility. Initial exploration includes inspecting data types, missing values, and basic statistics.
The project involves visualizing feature distributions and their correlations with the target variable. Geographic plotting highlights spatial trends in housing prices.
New features, such as rooms-per-household or bedrooms-per-room ratios, are derived to enhance predictive power. Categorical features are encoded, and numerical features are scaled or normalized as appropriate.
Multiple regression models are evaluated, including:
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor
Models are trained on a training set and evaluated on a hold-out test set to measure their generalization performance.
Performance metrics such as Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) quantify the accuracy of predictions. Cross-validation techniques help tune hyperparameters and prevent overfitting.
The Random Forest Regressor provided the best balance between bias and variance, achieving an RMSE of approximately [your result here]. Visualization of predictions versus actual prices confirms the model’s ability to capture key trends in the housing market.
This project successfully demonstrates the process of building a regression model for predicting house prices. It encompasses data acquisition, cleaning, feature engineering, model training, evaluation, and interpretation — providing a solid foundation for further enhancements such as incorporating additional data sources or advanced algorithms.
- Download and extract the data using the provided script:
from fetch_data import fetch_housing_data
fetch_housing_data()
- Load the dataset into a pandas DataFrame:
from fetch_data import load_housing_data
housing = load_housing_data()
- Proceed with exploration, preprocessing, modeling, and evaluation.
- Python 3.7+
- pandas
- numpy
- matplotlib
- scikit-learn
- jupyter (optional)
Install dependencies with:
pip install pandas numpy matplotlib scikit-learn jupyter
├── datasets/
│ └── housing/ # Dataset files after download and extraction
├── fetch_data.py # Data download and loading utility functions
├── house_price_prediction.ipynb # Jupyter notebook with exploration and modeling
├── README.md # Project overview and instructions
Dataset source and inspiration from: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd Edition) by Aurélien Géron