This project develops time series forecasting models for predicting housing prices in Singapore's HDB market using transaction data and geographical information.

ML Pipeline
Data Cleaning & Processing
- Scraped MRT stations, shopping malls, and primary schools with their coordinates and opening dates
- Cleaned and merged HDB resale flat prices datasets
- Calculated coordinates for each HDB flat using OneMapSG API
- Standardized dates and values across multiple datasets
- Processed SORA (Singapore Overnight Rate Average) dataset for financial context
Feature Engineering
- Identified MRT stations within 1 km radius of each HDB flat
- Calculated BTO supply within 4 km radius for each transaction
- Identified nearby malls within 1 km radius
- Measured distance to Central Business District
- Created POI density vectors using word embeddings
- Applied scaling for numerical features and one-hot encoding for categorical features
Model Building & Comparison
XGBoost — Gradient Boosting
Trained on working dataset and evaluated on 2024 resale prices.
Random Forest — Ensemble Learning
Built using Out-of-bag (OOB) method and 10-fold Cross Validation.
CatBoost — Gradient Boosting
Trained on cleaned and normalized data with categorical feature support.
GNNWR — Geospatial Model
Incorporates latitude and longitude via neural networks for geospatial weighting.
LSTM — Deep Learning
Three-layered neural network with dropout regularization for capturing temporal dependencies.



