Predict Housing Prices with Linear Regression using Scikit-Learn
Linear Regression is a Machine Learning approach that makes use of mathematical models to predict continuous numerical outputs.
The goal of this blog is to teach you how to:
train a Linear Regression estimator
evaluate a Linear Regression estimator
decide on which features to work with
train a better Linear Regression estimator
deal with overfitting
In this post, I will make use of a popular data set from Kaggle to show you how you can get started with training your own Linear Regression models using the popular scikit-learn library.
The data I will make use of is provided as part of a Kaggle competition. You can find it here. You will need to create an account before you can download the data. While you are there, be sure you read the data description.
If you have never done any work with scikit-learn and python, you have the following options:
Install anaconda on your personal computer, so you can make use of jupyter notebooks. If you take this approach, you will need to manage package installations on your own.
Make use of Google colaboratory, which is a hosted notebook server. This is an easier way to get started.
If you choose to make use of Google colaboratory, note that it is ephemeral. In order to work with the dataset files, you can upload them into a folder on your Google Drive (you have a Gmail account, right?).
If you host your data in Google Drive, use the following code snippet to mount your drive from colaboratory.
When you do that, you will get a prompt that lets you authenticate your access (security is key). Afterward, you can access your drive using
There are various libraries for reading data. One popular library is called pandas. It lets you read data into a structure called a DataFrame, and provides lots of useful methods.
I have my data in a folder called boston on my Google Drive. In order to access my files, I have used the same approach I illustrated above, so I will not repeat the code.
In the following code, I do two things:
import pandas and alias it as pd. This is a popular convention that saves me a few lines of code
read the data into a DataFrame which is called train_df. I do this using the read_csv method. I also happen to know that my data contains an index column called ID, so I specify that. You should, at the very least, read up on the pandas read_csv method, to get a clear idea of what it does.
In the blog introduction, I mentioned that we would be read our data and then splitting it. However, you can't split what you don't know.
Pandas has a DataFrame method called info that provides a summary of the data. The following image shows how you would make use of the method, and what the output looks like.
The output provides a lot of information, such as the number of rows and columns in your data, as well as how many records have missing values. In this case, there are no missing values (it's not a coincidence, this article isn't about cleaning up data).
Now that we have an idea of what we are dealing with, let's proceed to split the data.
If you need a refresher for dealing with data in python, get my free ebook "Working with Data in Python.
Split the Data
We will be splitting the data in a few ways.
We will decide on the columns to work with
We will decide on the rows to work with
There are 14 columns in our data, but one of those is what we want to predict, which is the selling price of the house. It is called medv. This is our target or sometimes called the label.
Of the remaining 13 columns, I will choose one to work with. Why one? Because I want to be lazy at the beginning and not have to think too much!
Which column will I choose? To answer that question, I will ask you to think about how you would describe a house if you wanted to buy or rent one.
Consider the image above. It is from a property website. I cut off the name because they are not paying me to advertise their site. Now, if you look in the middle, they ask the following questions:
where do you want to live?
what type of house are you interested in?
how many rooms do you want?
what is your budget?
I think that every house is described by the number of rooms. The type of house and location are important, and we will get to those at some point.
So, it's decided. I will work with the number of rooms. This is called rm in our data. The column or columns you choose to work with are called features or predictors.
The following two lines of code will extract the two columns I have chosen to work with.
The first split is done. Now, I need to carry out the second split.
In the second split, I need to create a holdout set. In Machine Learning, you do not make use of all of your data for training. Instead, you set aside some data for evaluating your model. This is called the holdout set or the evaluation set.
There are various approaches to doing this, but the key is for this data to be randomly selected. In order to accomplish that, we will make use of a helper function from scikit-learn called train_test_split. The following code snippet performs the split.
You should read the documentation for train_test_split, but here is a summary of what happened:
first, I want to be able to repeat this experiment and get the same result, so I specified a random_state. This could be any number, just make sure it remains the same.
next, I specified the ratio that I want to split my records into. I want 30% of the records reserved for model evaluation, so I set a test_size of 0.3
finally, I specified my features, which I previously called predictors, and my labels, which I previously called targets.
The output is four variables, called X and y. X represents my new features, while y represents my new labels. These all have a suffix of either train (which I will use for training) or test (which I will use for evaluation).
We are ready to train a Linear Regression estimator.
Train a Linear Regression Estimator
Training a Linear Regression estimator with scikit-learn is so simple that I will place all of the code together and then explain it.
Here is all of the code.
Here is what I did:
I imported a LinearRegression estimator from sklearn.
I created an instance of the estimator
I trained the estimator using the training data that was previously split.
Those three lines of code make Machine Learning look deceptively simple, and that is a good thing because we have other things to worry about.
After your model is trained, you will get the following output from the python interpreter:
Now we have a Linear Regression model or estimator, but is it any good?
Let's perform model evaluation to find out.
Evaluate the Estimator
Something that I like to do is to make use of the model for prediction, just to get a sense of what is going on. Estimators have a predict method that you can use.
In the next three lines of code, I will do the following:
use X_test, the evaluation set, to predict the selling prices
create a table (actually a DataFrame) to compare my predictions to the actual selling prices
print out the comparisons, side-by-side
Here is the code below.
And here is the output.
Observed is the actual selling price, while predicted is what the model predicted. Let's analyze this a bit. First, note that prices are in thousands of dollars.
For the building with an ID of 250, the actual selling price was $26,200 while we predicted a selling price of $27,250. This doesn't look too bad, we were only off by $1,050.
However, for the building with ID 19, the actual selling price was $20,200 while we predicted a price of $14,778. We were off by $5,422. That is almost 25% of the selling price!
The values we just computed are called loss. To put it in contact: how much money would we lose if we were to use this model?
To get a sense of whether or not we should try to make use of this model, we will need to take all of the loss into consideration. This is called model evaluation.
There are various evaluation criteria. Some of these are:
Mean Absolute Error
Mean Logarithmic Error
Mean Squared Error
Root Mean Squared Error
An analysis of the various evaluation criteria is outside the scope of this article, simply because I don't want you to fall asleep before you get to the end of the article. However, note that you should always use the same error function when you compare two models or estimators.
The goal of Machine Learning is to try and find a model with an error that is as close to 0 as possible.
Scikit-learn provides a helper function called mean_squared_error which I will use to evaluate the model.
I will also inspect something called the score. While I want my error to be as close to 0 as possible, I want my score to be as close to 1 as possible.
I will use the following code to evaluate my model.
The output of this is shown below.
I can deduce the following from the information above:
The model score is 30.7%. It's not a very good model
The Mean Squared Error (MSE) is 53. By taking the square root of that, I arrive at a Root Mean Squared Error (RMSE) of approximately 7.3. This means that if I were to make use of this model, I would lose an average of $7,300 in revenue.
Okay, we have made some progress. Let's proceed to understand our features and make some decisions about them.
Understanding Linear Regression
In order to understand Linear Regression, let's take a closer look at what I used.
Linear Regression is a mathematical hypothesis. I spared you the details earlier, and I will not introduce the math. However, what I did was I said:
I have a theory that the price of a building is dependent on the number of rooms
Is that theory correct?
In order to find out, we can make use of something called a correlation coefficient. This a value between -1 and 1. A correlation coefficient between -1 and 0 implies that as the number of rooms increases, the price will decrease. This is called a negative correlation. However, a correlation between 0 and 1 implies that as the number of rooms increases, the price will also increase. This is called a positive correlation.
In order to get a sense of the correlation between the number of rooms and the selling price for this data, I will create a type of visualization called a scatter plot.
These visualizations are frequently done using python libraries, but I will spare you that paid and make use of a free visualization tool called Tableau. You can also accomplish the same result using Microsoft Excel or Google Sheets.
The following image shows what the visualization looks like.
What the visualization shows is that as the number of rooms increases, the median price also increases.
So, how good is the number of rooms at predicting the median price? That is what our Linear Regression model is designed to tell us.
The Linear Regression model is actually what is called a line of best fit. The image below is our visualization with the line of best fit shown in green.
Our Linear Regression model is the equation of that green line. Is that green line reliable?
I could use my eyes to answer that question by saying that the points are kind of close to the line and that the points are positively correlated.
Or I could place my mouse over the line and get some statistical information. Here is the visualization along with the statistical information that is available.
Here is what the new information is telling us:
The way to find the median selling price of a house if to multiply the number of rooms by $8,986 and then subtract $33,537 from the amount.
The coefficient of correlation is 47.55%, so the model is somewhat useful. Note that the best models would have a value above 70%
The p-value is less than 0.05, and this is called a statistically significant hypothesis.
Armed with this new information, we can see that working with rooms was a good starting point.
The question now is, if we could do all this with Excel, why do we need Linear Regression?
We are using Linear Regression and training a model because we have more than two columns to work with.
So far, we have worked with only two columns, so we are working with two dimensions of data. That is easy to visualize, and that is what we have done.
However, remember that we have 13 columns in our data. How do we visualize 13 columns of data? If we chose to work with all of our columns, we would be working in 13 dimensions. How do we do that?
That is where Machine Learning comes in. It generates an equation for a line that exists in as many dimensions as it needs to.
Okay, enough of the maths.
Should we plot 12 visualizations to determine the correlation between each feature and the median value?
Well, we could. However, I will make use of a pandas method called corr. This will give me the correlation coefficient of every numerical column in my dataset. The following image shows the code and the resulting output.
The result is a matrix that shows you the correlation between every single column of data. What we are really interested in is the last column. By looking at that column, we can see which values are high and which are low.
If you check the row for rm, you will see that the correlation coefficient is 68.9596%. To save you some time, I have drawn some rectangles around columns with a correlation above 30%. Positive correlations are green, while negative correlations are red.
Before I proceed to write some more code I will ask myself one question:
Is there a linear relationship between the number of rooms and the median selling price?
Why am I asking that particular question?
I know that a relationship exists. The question, however, is whether my line of best fit should be a straight line.
There are different types of lines in mathematics, all represented by different equations.
To answer that question, I will go back to Tableau and plot different trend lines on my scatter plot. I will then consider the correlation coefficient of each line. Without going into any detail, here are the types of lines that Tableau will let me plot:
If you scroll back up you will see that our linear line had a coefficient of 47.55%. I have tried out different lines and found that a polynomial line has a higher coefficient. Here it is.
This is what the line above tells us:
A polynomial of degree 3 has a higher correlation than a straight line
The coefficient is now 57.65%
The hypothesis is statistically significant.
I will now proceed to decide on which features to work with!
Decide on which Features to Work With
Based on some of the exploratory work I have done above, I have decided to work with the following features:
I have also decided to replicate that polynomial line. That means I need to engineer two new features:
I will select the new features using the following code
I will then engineer the new features using the following code
I can now take a look at what predictors contains. The output is shown below.
Finally, I need to split the data into training and evaluation sets. It's the same code I used previously.
I am now ready to train a new model.
Train a Better Linear Regression Estimator (Hopefully)
Training a model is always the same.
Making predictions is also the same.
Finally, evaluating the model remains the same.
And here is what our new outcome looks like.
Is this a better model? I think it is. Note the following:
The score is now 53.46%, up from 30.75%
The MSE is now 35.8, down from 53.26. Our average loss from using this model would be $5,983 down from $7,298
Should we continue improving the model?
It's really up to you. Do you want to keep reading?
I am publishing a free course that gives you access to notebooks, visualizations, and quizzes called Basic Linear Regression. The course is available for pre-enrolment. Please sign up here.
Deal with overfitting
This was meant to be the final section in which we overfit a model and then fix that to arrive at a model with good metrics. However, that would double the length of this article, so I have decided to leave that for another article.
If you are new to working with data in python, or just need a refresher, get the free ebook here.
If you enjoyed reading this and haven't signed up for my newsletter, please do that so you know when the next part is available.