预测爱荷华州艾姆斯的房价
Introduction
对消费者和房地产从业者来说,预测房价在很多方面都是一种有用的分析. 你怎么知道你的房子卖得“好价钱”? 你的房子的哪些方面与价格最直接相关? 这些都是我们想要回答的问题, 同时找到各种房屋特征之间的关系.
我们采用了两种不同的方法来预测销售价格:
- 按数据类型对每一列进行系统分析, 检查相关性和多重共线性,并运行多重惩罚线性回归模型.
- 利用各种基于树的模型来提取非线性关系,产生了高精度的特征重要度模型.
The code can be found here.
Data Description
The dataset comes from a Kaggle competition. 它包含了艾姆斯出售房屋的79个特征, Iowa between 2006 and 2010, with 1,沙巴体育安卓版下载集里有460间房子和1间,459 houses in the test set.
连续变量主要描述物业不同部分的面积,以及不同类别的房间(浴室)的年龄和数量, bedroom, etc).
序数分类变量描述属性的质量和状态:总体, the garage, the basement, and the veneer of the house.
The nominal categorical variables describe things such as materials used, property type, binary features such as whether or not something is “finished” or whether there is central air, 还有附近的地标,比如公园或铁路.
探索性数据分析和特征选择
We want to avoid the “curse of dimensionality”: when more features are in the data, 数据变得更稀疏,更分散在特征中, thus making statistically significant conclusions becomes increasingly difficult. Therefore, 我们需要找到一种方法来选择对销售价格最重要的变量,并删除或合并任何冗余, 多重共线或不重要的特征.
Missingness
The above plot shows the missingness of the variables in the dataset. 对于大多数特性来说,缺失是由于属性不具有该特性. For example, 如果房子没有地下室或车库, 那么所有与地下室或车库相关的特征都消失了. 少量特征似乎更随机地丢失了. 40%的红色竖线表示我们的截止点用于删除有太多缺失值的变量. 我们使用这个指标来消除池质量, miscellaneous feature, alley, fence type, and fireplace quality features.
Imbalanced classes
As shown below, 我们发现许多分类特征仅由变量中的一个类描述, we ran a script to check for these. 如果大多数数据集属于一个类别, our model may have trouble picking up on the intricacies of the target variable (sales price). If 70% of observations were assigned to the same class of a given feature, 然后我们选择从分析中忽略这个特性.
Skewness


上面是目标变量:销售价格的直方图. 销售价格明显右偏(左). To get a more normal distribution, we applied a log transform to the sales price (left). 这允许更好地与回归技术拟合. The following plot shows the skewness of variables in the dataset. Two red vertical lines indicate the skewness of -1 and 1, respectively. 我们可以注意到,与门廊面积和平方英尺相关的变量似乎有很高的偏度. We later reduce this skewness through some feature engineering techniques. Also, the target variable (sale price) seems to have a problem with a skewness. 上述对数变换对销售价格的影响, nevertheless, 能不能大大降低销售价格的偏差.


相关性和多重共线性
一些连续变量和序数分类变量与销售价格的相关性较高. The above heatmap shows these correlations, with darker blue indicating a higher correlation. 销售价格位于图表的第一行和第一列, and one can see that it is most highly correlated with the overall quality of the house, and the ground living area. These two features are obviously important in determining the sales price. 特征之间的相关性也显示:车库可以容纳的汽车数量和车库的面积与销售价格相当高的相关,相关系数为0.64 and 0.分别为62,但它们之间的相关性也非常高,相关性为0.88. This makes sense, as a bigger garage should be able to fit more cars. Therefore, one of these variables may be redundant and unnecessary for the data. Similarly, the ground living area is highly correlated to the number of rooms above ground level. In order to remove multicollinear variables, we also ran Lasso and ridge regressions on the data.
After the above analysis, 其余变量如下:lot面积, house style, overall quality, overall condition, year built, masonry veneer area, exterior quality, heating quality, number of bedrooms above ground, kitchen quality, total rooms above ground, number of fireplaces, garage area, month sold, year sold.
在进入功能工程之前, we would like to take a look on variables that have high correlation with sale price.


上面的地块描述了与地面生活区相对的销售价格. We notice the strong positive relationship between these two variables. 一般来说,居住面积越大,售价越高. Nevertheless, 右下角有两个观察结果, 居住面积大但售价极低的城市, 是异常值吗?我们在以后的模型预测中去掉它们了吗.


上图是每个质量关卡的箱线图. We could see an obvious increasing trend between overall quality and sale price. On the other hand, there seem to more variations in the sale price for higher quality levels.




Plots above indicate that year and month could make a difference in the sale price. For example, 金融危机发生在2008年左右,由于金融危机,销售价格出现了小幅上涨. From the plot of monthly sale price, houses sold in spring tend to have a lower sale price. Because of this difference, 我们将年和月作为分类变量,在拟合模型前对它们进行仿真.
特征工程与数据估算
聚合变量包括:
“Total area” is the sum of the ground living area and the basement area. 下图显示销售价格与总建筑面积(“总面积”)成正相关关系. As a matter of fact, the correlation between tottal square footage and sale price is close to 0.8, which is higher than any of the individual area-related variabales.


“门廊总面积”是指任何门廊或甲板的面积之和. As above, 下图展示了总玄关面积与目标可变销售价格之间的正相关关系. 虽然相关性在0左右.4, which is relatively low. 然而,它比任何单独的门廊面积变量都要高.


“半浴”和“全浴”是由地下室的半浴/全浴组合而成的,因为与销售价格的相关性没有差异, 这些减少了列数. 下图显示了浴室的总数量(新创建的变量)和销售价格之间的关系.


Before fitting models, the last step is to impute the remaining variables with missing values. 没有地下室的房子被认为没有地下室,因此没有地下室浴室和地下室区域. Similarly, the missing garage areas and masonry veneer areas were imputed as 0. 厨房质量缺失值归为“典型/平均”, 因为这是这个特征的中位数和众数.
Models
Elastic Net
After running a 5-fold cross-validation on the elastic net hyperparameters, we found the best fit to be with a penalization parameter (alpha) of 0.01,套索/曼哈顿比率为0.01. 特征系数如下图所示. While the feature coefficients are not directly interpretable as in an unpenalized regression, 特征系数也表明每个特征在确定销售价格中的重要性. The plot shows the magnitude of each coefficient, with the sign given by color. 加热质量哑变量, kitchen quality, exterior quality, 和房屋风格是由“HQC”开头的变量给出的。,”KQ”, “EQ”, and “HS”, respectively. We can see that the most important features are the total square footage, the year built, the overall quality, the overall condition, and the garage area. This is fairly easy to understand, 因为一幢大房子,带一个大车库,是最近才建成的,而且质量和条件都很好,价格应该会更高.
我们使用在多次Lasso回归中随着lambda的增加系数降至零的速度来进一步减少特征的数量. 然后我们在弹性网上进行网格搜索. 这允许我们搜索是否使用Ridge惩罚或Lasso惩罚(rho超参数),以及每种添加多少(lamba参数).
This gave us a training error of 0.1019 and a testing error of 0.1395
虽然测试误差略高于沙巴体育安卓版下载误差,可能指向过拟合, 我们使用的过程是非常透明的,可以在不失去整个方法论的情况下进行调整或改变. We would choose this model to give to a client that doesn’t want a black-box approach.
Tree-Based Models
除了正则化的多元线性回归, 我们还执行了一些基于树的模型:随机森林, gradient boost, and extra-gradient (XG) boost. 上面绘制的特征重要性用于使用上述特征选择和工程的模型. 黑色虚线表示特性的重要性为0.05. Both the random forest and gradient-boosted models heavily favor house style and garage area, with random forest only having those two features with importance greater than 0.05, and the gradient-boosted model only also having the overall house condition over 0.05, but just barely. xg - boosting模型在多个变量之间共享重要性,有6个变量的重要性大于0.05: house style, number of half baths, garage area, total area, overall condition, and masonry veneer area. 对于所有这些模型,这提出了一个问题:为什么随机森林和梯度增强模型更喜欢房子的风格和车库面积,而不是房子的大小, 为什么XG boost更喜欢半浴而不是全浴? This may be because if two variables have a high correlation or high contingency, 基于树的方法可以毫无区别地选择其中任何一个. 房子的风格可能与房子的大小密切相关. 例如,2层的房子应该比1层的房子大.
而基于树的模型比线性模型的可解释性更差, 它们确实有一个显著的优势:分类特征不必转换为虚拟变量,而只需使用标签编码. That keeps the dimensionality low and the feature space smaller and therefore easier to fit. Therefore, we added back in most of the variables and reperformed the grid searches. 以下是XG Boost模型中最重要的30个特性的重要性. Here, we see that random forest and gradient boost again favor two features, but this time they are different features: lot area and exterior quality. Meanwhile, XG提升最重要的特性是整体状况, lot area, overall quality, the fence type, exterior quality, and garage quality.
Conclusions
Below is a table of the training and test (Kaggle) errors for each model. The “v2” indicates fits performed on the dataset with features added back in. Below is a table of the training and test (Kaggle) errors for each model. The “v2” indicates fits performed on the dataset with features added back in. 在特征选择越激进的随机森林模型上,沙巴体育安卓版下载误差越大,测试误差也越大. However, that fit also appears to have the lowest amount of overfitting with a difference in errors of 1.85%,其余至少有3%过拟合. 添加更多的特性可以降低所有基于树的模型中的错误. The lowest overall training and test errors were the result of using XG boost. This model more equally partitioned the feature importance over the variables, 不像随机森林和梯度增强, 这可能有助于它的整体更好的适合.
Model | MLR | RF | Grad | XG | RF - v2 | Grad -v2 | XG - v2 |
Training Error | 0.1019 | 0.1253 | 0.1053 | 0.1024 | 0.1145 | 0.0918 | 0.0881 |
Test Error | 0.1395 | 0.1438 | 0.1353 | 0.1342 | 0.1452 | 0.1340 | 0.1247 |
Overfitting | 0.0376 | 0.0185 | 0.0300 | 0.0318 | 0.0307 | 0.0422 | 0.0366 |
在简化的数据集上,多元线性回归的表现与增强模型相似. This indicated the overall linearity of the features with the log transform of the sales price. 同时添加更多的功能创建了一个整体更好的适合, the main advantage of the linear regression is in its interpretability. 很容易看出这些特性与销售价格之间的关系. 适合任何想要买卖房子的人, 最重要的特征是整体尺寸, quality, and condition, along with the house style, year built, lot size, and garage size.