预测爱荷华州艾姆斯的房价

Introduction

对消费者和房地产从业者来说,预测房价在很多方面都是一种有用的分析.  你怎么知道你的房子卖得“好价钱”? 你的房子的哪些方面与价格最直接相关? 这些都是我们想要回答的问题, 同时找到各种房屋特征之间的关系. 

我们采用了两种不同的方法来预测销售价格:

  1. 按数据类型对每一列进行系统分析, 检查相关性和多重共线性,并运行多重惩罚线性回归模型.
  2. 利用各种基于树的模型来提取非线性关系,产生了高精度的特征重要度模型.

The code can be found here.

Data Description

The dataset comes from a Kaggle competition. 它包含了艾姆斯出售房屋的79个特征, Iowa between 2006 and 2010, with 1,沙巴体育安卓版下载集里有460间房子和1间,459 houses in the test set.

连续变量主要描述物业不同部分的面积,以及不同类别的房间(浴室)的年龄和数量, bedroom, etc).

序数分类变量描述属性的质量和状态:总体, the garage, the basement, and the veneer of the house. 

The nominal categorical variables describe things such as materials used, property type, binary features such as whether or not something is “finished” or whether there is central air, 还有附近的地标,比如公园或铁路. 

探索性数据分析和特征选择

We want to avoid the “curse of dimensionality”: when more features are in the data, 数据变得更稀疏,更分散在特征中, thus making statistically significant conclusions becomes increasingly difficult. Therefore, 我们需要找到一种方法来选择对销售价格最重要的变量,并删除或合并任何冗余, 多重共线或不重要的特征. 

Missingness

The above plot shows the missingness of the variables in the dataset. 对于大多数特性来说,缺失是由于属性不具有该特性. For example, 如果房子没有地下室或车库, 那么所有与地下室或车库相关的特征都消失了. 少量特征似乎更随机地丢失了. 40%的红色竖线表示我们的截止点用于删除有太多缺失值的变量. 我们使用这个指标来消除池质量, miscellaneous feature, alley, fence type, and fireplace quality features.

Imbalanced classes

As shown below, 我们发现许多分类特征仅由变量中的一个类描述, we ran a script to check for these. 如果大多数数据集属于一个类别, our model may have trouble picking up on the intricacies of the target variable (sales price). If 70% of observations were assigned to the same class of a given feature, 然后我们选择从分析中忽略这个特性.

Skewness

上面是目标变量:销售价格的直方图. 销售价格明显右偏(左).  To get a more normal distribution, we applied a log transform to the sales price (left). 这允许更好地与回归技术拟合. The following plot shows ​​​the skewness of variables in the dataset. Two red vertical lines indicate the skewness of -1 and 1, respectively. 我们可以注意到,与门廊面积和平方英尺相关的变量似乎有很高的偏度. We later reduce this skewness through some feature engineering techniques. Also, the target variable (sale price) seems to have a problem with a skewness. 上述对数变换对销售价格的影响, nevertheless, 能不能大大降低销售价格的偏差.

相关性和多重共线性

一些连续变量和序数分类变量与销售价格的相关性较高. The above heatmap shows these correlations, with darker blue indicating a higher correlation. 销售价格位于图表的第一行和第一列, and one can see that it is most highly correlated with the overall quality of the house, and the ground living area. These two features are obviously important in determining the sales price. 特征之间的相关性也显示:车库可以容纳的汽车数量和车库的面积与销售价格相当高的相关,相关系数为0.64 and 0.分别为62,但它们之间的相关性也非常高,相关性为0.88. This makes sense, as a bigger garage should be able to fit more cars. Therefore, one of these variables may be redundant and unnecessary for the data. Similarly, the ground living area is highly correlated to the number of rooms above ground level. In order to remove multicollinear variables, we also ran Lasso and ridge regressions on the data.

After the above analysis, 其余变量如下:lot面积, house style, overall quality, overall condition, year built, masonry veneer area, exterior quality, heating quality, number of bedrooms above ground, kitchen quality, total rooms above ground, number of fireplaces, garage area, month sold, year sold. 

在进入功能工程之前, we would like to take a look on variables that have high correlation with sale price.

上面的地块描述了与地面生活区相对的销售价格. We notice the strong positive relationship between these two variables. 一般来说,居住面积越大,售价越高. Nevertheless, 右下角有两个观察结果, 居住面积大但售价极低的城市, 是异常值吗?我们在以后的模型预测中去掉它们了吗.

上图是每个质量关卡的箱线图. We could see an obvious increasing trend between overall quality and sale price. On the other hand, there seem to more variations in the sale price for higher quality levels.

Plots above indicate that year and month could make a difference in the sale price. For example, 金融危机发生在2008年左右,由于金融危机,销售价格出现了小幅上涨. From the plot of monthly sale price, houses sold in spring tend to have a lower sale price. Because of this difference, 我们将年和月作为分类变量,在拟合模型前对它们进行仿真.

特征工程与数据估算

聚合变量包括:

“Total area” is the sum of the ground living area and the basement area. 下图显示销售价格与总建筑面积(“总面积”)成正相关关系. As a matter of fact, the correlation between tottal square footage and sale price is close to 0.8, which is higher than any of the individual area-related variabales.

“门廊总面积”是指任何门廊或甲板的面积之和. As above, 下图展示了总玄关面积与目标可变销售价格之间的正相关关系. 虽然相关性在0左右.4, which is relatively low. 然而,它比任何单独的门廊面积变量都要高.

“半浴”和“全浴”是由地下室的半浴/全浴组合而成的,因为与销售价格的相关性没有差异, 这些减少了列数. 下图显示了浴室的总数量(新创建的变量)和销售价格之间的关系.

Before fitting models, the last step is to impute the remaining variables with missing values. 没有地下室的房子被认为没有地下室,因此没有地下室浴室和地下室区域. Similarly, the missing garage areas and masonry veneer areas were imputed as 0. 厨房质量缺失值归为“典型/平均”, 因为这是这个特征的中位数和众数.

Models

Elastic Net

After running a 5-fold cross-validation on the elastic net hyperparameters, we found the best fit to be with a penalization parameter (alpha) of 0.01,套索/曼哈顿比率为0.01. 特征系数如下图所示. While the feature coefficients are not directly interpretable as in an unpenalized regression, 特征系数也表明每个特征在确定销售价格中的重要性. The plot shows the magnitude of each coefficient, with the sign given by color. 加热质量哑变量, kitchen quality, exterior quality, 和房屋风格是由“HQC”开头的变量给出的。,”KQ”, “EQ”, and “HS”, respectively. We can see that the most important features are the total square footage, the year built, the overall quality, the overall condition, and the garage area. This is fairly easy to understand, 因为一幢大房子,带一个大车库,是最近才建成的,而且质量和条件都很好,价格应该会更高. 

我们使用在多次Lasso回归中随着lambda的增加系数降至零的速度来进一步减少特征的数量. 然后我们在弹性网上进行网格搜索. 这允许我们搜索是否使用Ridge惩罚或Lasso惩罚(rho超参数),以及每种添加多少(lamba参数). 

This gave us a training error of 0.1019 and a testing error of 0.1395

虽然测试误差略高于沙巴体育安卓版下载误差,可能指向过拟合, 我们使用的过程是非常透明的,可以在不失去整个方法论的情况下进行调整或改变. We would choose this model to give to a client that doesn’t want a black-box approach. 

Tree-Based Models

除了正则化的多元线性回归, 我们还执行了一些基于树的模型:随机森林, gradient boost, and extra-gradient (XG) boost. 上面绘制的特征重要性用于使用上述特征选择和工程的模型. 黑色虚线表示特性的重要性为0.05. Both the random forest and gradient-boosted models heavily favor house style and garage area, with random forest only having those two features with importance greater than 0.05, and the gradient-boosted model only also having the overall house condition over 0.05, but just barely. xg - boosting模型在多个变量之间共享重要性,有6个变量的重要性大于0.05: house style, number of half baths, garage area, total area, overall condition, and masonry veneer area. 对于所有这些模型,这提出了一个问题:为什么随机森林和梯度增强模型更喜欢房子的风格和车库面积,而不是房子的大小, 为什么XG boost更喜欢半浴而不是全浴? This may be because if two variables have a high correlation or high contingency, 基于树的方法可以毫无区别地选择其中任何一个. 房子的风格可能与房子的大小密切相关. 例如,2层的房子应该比1层的房子大.

而基于树的模型比线性模型的可解释性更差, 它们确实有一个显著的优势:分类特征不必转换为虚拟变量,而只需使用标签编码. That keeps the dimensionality low and the feature space smaller and therefore easier to fit. Therefore, we added back in most of the variables and reperformed the grid searches. 以下是XG Boost模型中最重要的30个特性的重要性. Here, we see that random forest and gradient boost again favor two features, but this time they are different features: lot area and exterior quality. Meanwhile, XG提升最重要的特性是整体状况, lot area, overall quality, the fence type, exterior quality, and garage quality. 

Conclusions

Below is a table of the training and test (Kaggle) errors for each model. The “v2” indicates fits performed on the dataset with features added back in. Below is a table of the training and test (Kaggle) errors for each model. The “v2” indicates fits performed on the dataset with features added back in. 在特征选择越激进的随机森林模型上,沙巴体育安卓版下载误差越大,测试误差也越大. However, that fit also appears to have the lowest amount of overfitting with a difference in errors of 1.85%,其余至少有3%过拟合. 添加更多的特性可以降低所有基于树的模型中的错误.  The lowest overall training and test errors were the result of using XG boost. This model more equally partitioned the feature importance over the variables, 不像随机森林和梯度增强, 这可能有助于它的整体更好的适合.

ModelMLRRFGradXGRF - v2Grad -v2XG - v2
Training Error0.10190.12530.10530.10240.11450.09180.0881
Test Error0.13950.14380.13530.13420.14520.13400.1247
Overfitting0.03760.01850.03000.03180.03070.04220.0366

在简化的数据集上,多元线性回归的表现与增强模型相似. This indicated the overall linearity of the features with the log transform of the sales price. 同时添加更多的功能创建了一个整体更好的适合, the main advantage of the linear regression is in its interpretability. 很容易看出这些特性与销售价格之间的关系. 适合任何想要买卖房子的人, 最重要的特征是整体尺寸, quality, and condition, along with the house style, year built, lot size, and garage size.

About Authors

Lilliana Nishihira

A NY native exploring intersections of data, arts, business, and humanitarianism. 我在克拉克大学获得了数学学士学位. 由于有数字媒体的背景,我对数据描述行为的方式特别感兴趣. I often...
View all posts by Lilliana Nishihira >

Jennifer Ruddock

沙巴体育安卓版下载有素的物理化学家,我的博士.D. 涉及使用Python分析100+ Tb的数据集. 我爱上了数据世界,并选择通过获得沙巴体育安卓版下载(NYC data science Academy)的认证来从事数据科学,...
View all posts by Jennifer Ruddock >

Hanbo Shao

Data Scientist with a strong quantitative background in mathematics and operations research. Detail-oriented, 对应用数据分析和机器学习技能解决现实生活中的问题充满好奇心和积极性. 具有团队合作精神,乐于学习新知识...
View all posts by Hanbo Shao >

Daniel Laufer

Salutations! 我是沙巴体育安卓版下载的数据科学研究员,理论物理学家和历史学家.
View all posts by Daniel Laufer >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI