Member-only story
Feature engineering of the Kaggle House Price Dataset.
Optimizing and submitting my predictions for the House Price Kaggle competition.
Earlier this week I started my first machine learning project by tackling the beginner competition for the House Prices Kaggle data set. Kaggle neatly separated the data set for us into a training and test set but I was not prepared for the amount of data in the training set. As a Chemistry major I’m used to seeing hundreds of observations and testing several features with various chemical instruments. However, this data set has nearly 1500 instances with 80 features, I was overwhelmed.
Nonetheless, I have learned quite a bit over the last month of teaching myself data science and machine learning so I just started from the top. The first steps involve exploratory data analysis. Basically getting a feel for the data. I used a correlation matrix to see the relationship between SalePrice and the other features. Then I took the features with the highest correlation and trained them with a Linear Regression model. The training was done using only 5 features and they were all numerical. The model turned out to be underfit and the RMSE was close to 40,000. This is rather large considering the median SalePrice is 160,000.
The bias for the data set is very high leading to off predictions and a short solution is to increase the…