Real estate valuation is at the core of buying and selling properties. Getting the right price to sell a good is paramount to successfully close a deal. On the contrary, getting that initial offer number wrong can lead to either drastically undersell one’s property or simply not find an acquirer. While many subjective parameters such as the perceived value of a neighborhood are taken into account to set a final price, a good and realistic market value is very important to ensure a fair negotiation. Hence, finding the true market value of a real estate good is an essential skill for appraisers. To help them in doing that, data-driven models have been used for a long time in real estate valuation.
Data-driven real estate valuation in numbers
For a while now, basic regression models have given a baseline on which expert appraisers can add specific knowledge about the local reality surrounding the property. Nevertheless, a decade ago, the US market was shaken up with the arrival of a new type of predictions offered by Zillow. While basic models had average performance when predicting assets’ value, Zillow’s advanced models, which kept improving by aggregating predictions from multiple models, reached extreme levels of prediction accuracy, lowering the margin of error to around 5%. Even better, a group of Dutch researchers showed that their automated valuation model outperformed traditional appraisals relying on hedonic models. Using “hyper-local” information, they showed that they can achieve better performance than traditional methods in much less time: a regular appraisal takes 3-4 weeks and can cost up to $ 5’000, while a machine learning model can do that in a few milliseconds and for free, provided that the data is available.
Here we show a specific example of how advanced machine learning methods yield more precise estimates compared to basic statistical models.
In order to train our model, we have access to a dataset relating around 25’000 real estate transactions in Russia. We will assess the performance of our model on a separate set of 5’000 transactions that happened after the ones in the first dataset, to mimic a realistic situation.
Each property has up to 290 features, which relate both characteristics of the property and of their neighborhoods. Here are a few examples:
|num_room||number of living rooms|
|oil_chemistry_raion||Presence of dirty industries in the neighborhood|
|7_14_male||Male population aged 7-14 in the neighborhood|
|sport_count_1000||The number of sport facilities in 1000 meters zone|
Some properties have incomplete information and we need to take this into account when constructing our model. To avoid losing interesting features, we impute the missing values using the median value obtained from all the other properties.
Finally, the value we want to predict is the price of the transaction. In order to measure the accuracy of our predictions, we evaluate the mean squared logarithmic error between the prediction and real value. We use the logarithm of the price, as the prices are spread on different scales.
We start with a simple regression model fitted to all the entries. Linear regression tries to find a straight line that approximate the data points as well as possible. It is limited in that it cannot capture non-linearities in the data set. An example of linear regression can be seen in the figure.
We will use an ensemble model to increase the precision of our prediction, this means that we will train several machine learning models and combine their predicted outputs. We rely on gradient boosting, a method that trains multiple decision trees and optimize the overall prediction using a gradient-based method.
Decision trees are a simple way of splitting a dataset based on the observed features, one at a time. An example is given in the figure here. You can imagine it as inputting the property (datapoint) at the top, choosing which path to follow depending on its features and having a prediction at the end, similarly to what is depicted in the figure on the left. Machine learning algorithms generate these decision trees by looking at large number of data points. Decision trees alone are valuable because they can easily be interpreted, but only provide very approximate estimates and are not suitable for precise regression with many features. Gradient boosting creates many examples of these weak predictors and combine their output.
If you would like more details on gradient boosting, please have a look here.
Linear regression scores with a mean squared logarithmic error of 0.271. The model also predicted some negative values, indicating that it doesn’t necessarily work well for all possible properties.
Our ensemble model generated by gradient boosting managed to improve the overall performance of our predictions by 39.6 %, reducing the mean squared logarithmic error to 0.164.
In just a couple hours, we managed to create an ensemble model which collects and aggregates prediction from numerous smaller models and increased our prediction accuracy by 39.6%. More advanced algorithms can include the combination of many more models to improve the prediction accuracy even more. Furthermore, one could also include macroeconomic indicators in the model to account for general trends in the market at a specific time.
Do you have a project idea or simply want to find out how Machine Learning can benefit your business? Book a free 15 minutes consultation with Visium here.