This project develops time series forecasting models for predicting housing prices in Singapore's HDB market using transaction data and geographical information.
Project Overview
The notebooks are organized sequentially (00-15), with 00-10 found under the src directory and 11-15 in the root directory. The project follows a comprehensive workflow from data collection and cleaning to advanced model building and evaluation.

Data Cleaning & Processing
- Scraped MRT stations, shopping malls, and primary schools with their coordinates and opening dates
- Cleaned and merged HDB resale flat prices datasets
- Calculated coordinates for each HDB flat using OneMapSG API
- Standardized dates and values across multiple datasets
- Processed SORA (Singapore Overnight Rate Average) dataset for financial context
Feature Engineering
- Identified MRT stations within 1km radius of each HDB flat
- Calculated BTO supply within 4km radius for each transaction
- Identified nearby malls within 1km radius
- Measured distance to Central Business District
- Created POI density vectors using word embeddings
- Applied scaling for numerical features and one-hot encoding for categorical features
Model Building & Comparison

XGBoost Regressor Performance

Random Forest Using OOB and Cross-Validation

CatBoost Gradient Boosting Analysis
Several machine learning models were implemented and compared:
- XGBoost Regressor: Trained on working dataset and evaluated on 2024 resale prices
- Random Forest: Built using both Out-of-bag (OOB) method and 10-fold Cross Validation
- CatBoost: Gradient boosting algorithm trained on cleaned and normalized data
- GNNWR: Geospatial model incorporating latitude and longitude using neural networks
- LSTM: Three-layered neural network with dropout regularization for capturing temporal dependencies
Technology Stack
- Python with Pandas, NumPy, and GeoPandas for data manipulation
- Scikit-Learn, PyTorch, and TensorFlow for machine learning models
- XGBoost, CatBoost, and custom neural networks for advanced modeling
- Spacy for natural language processing and word embeddings
- Data visualization libraries for interactive charts and model evaluation