Housing Pricing Prediction preview

Housing Pricing Prediction

A model to predict resale prices of HDB flats in Singapore.

DataAnalysisMachineLearningSciKitLearnPyTorchTensorFlowPandasNumpyTimeSeriesXGBoostCatBoost

Problem

HDB resale flat prices in Singapore are shaped by dozens of geospatial and temporal factors. Existing valuation tools were opaque, failed to account for local amenity density, and couldn't forecast future price trends.

Solution

Built a multi-model forecasting pipeline across 16 Jupyter notebooks — engineering 20+ features from MRT proximity, school density, BTO supply, and SORA rates — then benchmarked XGBoost, CatBoost, GNNWR, and LSTM head-to-head.

Achievements

  • 5 ML models benchmarked on the same dataset
  • 20+ engineered geospatial and temporal features
  • OneMapSG API integration for coordinate resolution
  • Best model achieved <8% MAPE on 2024 held-out data

This project develops time series forecasting models for predicting housing prices in Singapore's HDB market using transaction data and geographical information.

Housing Price Prediction Methodology

ML Pipeline

Data Cleaning & Processing
  • Scraped MRT stations, shopping malls, and primary schools with their coordinates and opening dates
  • Cleaned and merged HDB resale flat prices datasets
  • Calculated coordinates for each HDB flat using OneMapSG API
  • Standardized dates and values across multiple datasets
  • Processed SORA (Singapore Overnight Rate Average) dataset for financial context
Feature Engineering
  • Identified MRT stations within 1 km radius of each HDB flat
  • Calculated BTO supply within 4 km radius for each transaction
  • Identified nearby malls within 1 km radius
  • Measured distance to Central Business District
  • Created POI density vectors using word embeddings
  • Applied scaling for numerical features and one-hot encoding for categorical features

Model Building & Comparison

XGBoost
Gradient Boosting

Trained on working dataset and evaluated on 2024 resale prices.

Random Forest
Ensemble Learning

Built using Out-of-bag (OOB) method and 10-fold Cross Validation.

CatBoost
Gradient Boosting

Trained on cleaned and normalized data with categorical feature support.

GNNWR
Geospatial Model

Incorporates latitude and longitude via neural networks for geospatial weighting.

LSTM
Deep Learning

Three-layered neural network with dropout regularization for capturing temporal dependencies.

XGBoost Model PerformanceRandom Forest Model PerformanceCatBoost Model Performance