这篇文章翻译自kaggle房价预测 上点赞数第二的kernel ,该kernel涵盖了从数据预处理到最终建模的完整过程。
导入数据 1 2 3 4 5 6 7 8 9 10 11 12 import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport seaborn as snscolor = sns.color_palette() sns.set_style('darkgrid' ) from scipy import statsfrom scipy.stats import norm, skewimport warningswarnings.filterwarnings('ignore' ) %matplotlib inline
1 2 3 train = pd.read_csv('train.csv' ) test = pd.read_csv('test.csv' )
数据预处理
1 2 3 4 5 train_ID = train['Id' ] train.drop("Id" , axis = 1 , inplace = True ) test_ID = test['Id' ] test.drop("Id" , axis = 1 , inplace = True )
离群值 1 2 3 4 5 fig, ax = plt.subplots() ax.scatter(x = train['GrLivArea' ], y = train['SalePrice' ]) plt.ylabel('SalePrice' , fontsize=13 ) plt.xlabel('GrLivArea' , fontsize=13 )
从图中可以明显的看出:有两个具有巨大面积的房屋的房价显然过低,因此可以安全地删除它们。注意 :虽然训练数据中还有其他的离群值可以删除,但无法保证测试数据集中也有离群值,若删除训练样本中的大量离群值会影响测试集上的精度。因此只删除了两个,以使模型更健壮。
1 2 3 4 5 6 7 8 train = train.drop(train[(train['GrLivArea' ]>4000 ) & (train['SalePrice' ]<300000 )].index) fig, ax = plt.subplots() ax.scatter(x = train['GrLivArea' ], y = train['SalePrice' ]) plt.ylabel('SalePrice' , fontsize=13 ) plt.xlabel('GrLivArea' , fontsize=13 )
目标变量 Y——房价(SalePrice
)是我们需要预测的目标变量。首先要查看它是否满足正态分布。
1 2 3 4 5 6 7 8 9 10 11 12 13 sns.distplot(train['SalePrice' ], fit=norm); plt.legend(['Normal dist' ], loc='best' ) plt.ylabel('Frequency' ) plt.title('SalePrice distribution' ) (mu, sigma) = norm.fit(train['SalePrice' ]) print ('\n mu = {:.2f} and sigma = {:.2f}\n' .format (mu, sigma))fig = plt.figure() res = stats.probplot(train['SalePrice' ], plot=plt)
输出 :mu = 12.02 and sigma = 0.40
可见目标变量Y呈现一种偏态分布,需要将其转换为正态分布以便运用线性回归模型。这里运用了log1p
函数,返回的是$\log(1+x)$。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 train['SalePrice' ] = np.log1p(train['SalePrice' ]) sns.distplot(train['SalePrice' ], fit=norm); plt.legend(['Normal dist' ], loc='best' ) plt.ylabel('Frequency' ) plt.title('SalePrice distribution' ) (mu, sigma) = norm.fit(train['SalePrice' ]) print ('\n mu = {:.2f} and sigma = {:.2f}\n' .format (mu, sigma))fig = plt.figure() res = stats.probplot(train['SalePrice' ], plot=plt)
相关性 1 2 3 4 corrmat = train.corr() plt.subplots(figsize=(12 ,9 )) sns.heatmap(corrmat, vmax=0.9 , square=True )
缺失值 1 2 3 4 5 6 7 ntrain = train.shape[0 ] ntest = test.shape[0 ] y_train = train.SalePrice.values all_data = pd.concat((train, test)).reset_index(drop=True ) all_data.drop(['SalePrice' ], axis=1 , inplace=True ) print ("all_data size is : {}" .format (all_data.shape))
输出 :all_data size is : (2917, 79)
1 2 3 4 5 6 7 8 9 10 11 12 13 all_data_na = (all_data.isnull().sum () / len (all_data)) * 100 all_data_na = all_data_na.drop(all_data_na[all_data_na == 0 ].index).sort_values(ascending=False )[:30 ] missing_data = pd.DataFrame({'Missing Ratio' :all_data_na}) missing_data.head(20 ) f, ax = plt.subplots(figsize=(15 ,12 )) plt.xticks(rotation='90' ) sns.barplot(x=all_data_na.index, y=all_data_na) plt.xlabel('Features' , fontsize=15 ) plt.ylabel('Percent of missing values' , fontsize=15 ) plt.title('Percent missing data by feature' , fontsize=15 )
参照数据下载页 给出的各个变量的解释(data_description.txt),逐一补充缺失值。
PoolQC
, MiscFeature
, Alley
, Fence
, FireplaceQu
:这些变量里的空值代表样本房屋不包含泳池、篱笆等设施,因此用None
来填充。
1 2 3 4 5 all_data["PoolQC" ] = all_data["PoolQC" ].fillna("None" ) all_data["MiscFeature" ] = all_data["MiscFeature" ].fillna("None" ) all_data["Alley" ] = all_data["Alley" ].fillna("None" ) all_data["Fence" ] = all_data["Fence" ].fillna("None" ) all_data["FireplaceQu" ] = all_data["FireplaceQu" ].fillna("None" )
LotFrontage
:与街道相连的房产的面积,可以用同一个小区内其他样本该项的中位数来补全。
1 all_data["LotFrontage" ] = all_data.groupby("Neighborhood" )["LotFrontage" ].transform(lambda x: x.fillna(x.median()))
GarageType
, GarageFinish
, GarageQual
, GarageCond
:车库系列变量1,为标称型,用None
来填充。
1 2 for col in ('GarageType' , 'GarageFinish' , 'GarageQual' , 'GarageCond' ): all_data[col] = all_data[col].fillna('None' )
GarageYrBlt
, GarageArea
, GarageCars
:车库系列变量2,为数值型,用0来填充。
1 2 for col in ('GarageYrBlt' , 'GarageArea' , 'GarageCars' ): all_data[col] = all_data[col].fillna(0 )
BsmtFinSF1
, BsmtFinSF2
, BsmtUnfSF
, TotalBsmtSF
, BsmtFullBath
, BsmtHalfBath
:地下室系列变量1,为数值型,用0来填充。
1 2 for col in ('BsmtFinSF1' , 'BsmtFinSF2' , 'BsmtUnfSF' ,'TotalBsmtSF' , 'BsmtFullBath' , 'BsmtHalfBath' ): all_data[col] = all_data[col].fillna(0 )
BsmtQual
, BsmtCond
, BsmtExposure
, BsmtFinType1
, BsmtFinType2
:地下室系列变量2,为标称型,用None
来填充。
1 2 for col in ('BsmtQual' , 'BsmtCond' , 'BsmtExposure' , 'BsmtFinType1' , 'BsmtFinType2' ): all_data[col] = all_data[col].fillna('None' )
MasVnrArea
, MasVnrType
:空值代表样本房屋不包含砖石贴面墙,因此分别用0和None
来填充。
1 2 all_data["MasVnrType" ] = all_data["MasVnrType" ].fillna("None" ) all_data["MasVnrArea" ] = all_data["MasVnrArea" ].fillna(0 )
MSZoning
, Electrical
, KitchenQual
, Exterior1st
, Exterior2nd
, SaleType
:因为这些变量中仅有个别缺失值,可采用众数填充。
1 2 3 4 5 6 all_data['MSZoning' ] = all_data['MSZoning' ].fillna(all_data['MSZoning' ].mode()[0 ]) all_data['Electrical' ] = all_data['Electrical' ].fillna(all_data['Electrical' ].mode()[0 ]) all_data['KitchenQual' ] = all_data['KitchenQual' ].fillna(all_data['KitchenQual' ].mode()[0 ]) all_data['Exterior1st' ] = all_data['Exterior1st' ].fillna(all_data['Exterior1st' ].mode()[0 ]) all_data['Exterior2nd' ] = all_data['Exterior2nd' ].fillna(all_data['Exterior2nd' ].mode()[0 ]) all_data['SaleType' ] = all_data['SaleType' ].fillna(all_data['SaleType' ].mode()[0 ])
Utilities
:这一项除了两个缺失值外只有训练集中有一个值是特例,也就是说这一个变量对预测没有任何影响,故删去。
1 all_data = all_data.drop(['Utilities' ], axis=1 )
Functional
:数据说明中有提到,空值代表Typ
(典型)。
1 all_data["Functional" ] = all_data["Functional" ].fillna("Typ" )
MSSubClass
:缺失值可能意味着该样本的房屋没有建筑等级,用None
来填充。
1 all_data['MSSubClass' ] = all_data['MSSubClass' ].fillna("None" )
然后再检查一遍是否还遗漏了未补全的缺失值。
1 2 3 4 5 all_data_na = (all_data.isnull().sum () / len (all_data)) * 100 all_data_na = all_data_na.drop(all_data_na[all_data_na == 0 ].index).sort_values(ascending=False ) missing_data = pd.DataFrame({'Missing Ratio' :all_data_na}) missing_data.head()
特征工程 数值型转换为标称型 诸如MSSubClass
(住宅类型)、OverallCond
(房屋总体质量)等变量,虽然它们的值是数字,但本质上属于标称型变量,需将其转化为字符串类型。
1 2 3 4 all_data['MSSubClass' ] = all_data['MSSubClass' ].apply(str ) all_data['OverallCond' ] = all_data['OverallCond' ].astype(str ) all_data['YrSold' ] = all_data['YrSold' ].astype(str ) all_data['MoSold' ] = all_data['MoSold' ].astype(str )
定序型转换为数值型 诸如FireplaceQu
(壁炉质量)、BsmtQual
(地下室高度)、GarageQual
(车库质量)等变量,虽然它们是字符串类型的标称型变量,但其隐含了顺序的信息(如质量的好坏、功能的多少),需要对其进行Label Encoding,使其转化为有序的数字。
1 2 3 4 5 6 7 8 9 10 11 from sklearn.preprocessing import LabelEncodercols = ('FireplaceQu' , 'BsmtQual' , 'BsmtCond' , 'GarageQual' , 'GarageCond' , 'ExterQual' , 'ExterCond' ,'HeatingQC' , 'PoolQC' , 'KitchenQual' , 'BsmtFinType1' , 'BsmtFinType2' , 'Functional' , 'Fence' , 'BsmtExposure' , 'GarageFinish' , 'LandSlope' , 'LotShape' , 'PavedDrive' , 'Street' , 'Alley' , 'CentralAir' , 'MSSubClass' , 'OverallCond' , 'YrSold' , 'MoSold' ) for c in cols: lbl = LabelEncoder() lbl.fit(list (all_data[c].values)) all_data[c] = lbl.transform(list (all_data[c].values))
创造新变量 鉴于房屋面积在决定房价的各个变量中居于主导地位,原作者添加了一个新的变量TotalSF
,它是房屋地下室、第一层和第二层面积的总和。
1 all_data['TotalSF' ] = all_data['TotalBsmtSF' ] + all_data['1stFlrSF' ] + all_data['2ndFlrSF' ]
正态化:box-cox变换 1 2 3 4 5 6 7 8 numeric_feats = all_data.dtypes[all_data.dtypes != "object" ].index skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False ) print ("\nSkew in numerical features: \n" )skewness = pd.DataFrame({'Skew' :skewed_feats}) skewness.head(10 )
使用box-cox变换来正态化呈现明显偏态的变量。关于box-cox变换,可以参考scipy的官方文档 。
1 2 3 4 5 6 7 8 skewness = skewness[abs (skewness) > 0.75 ] print ("There are {} skewed numerical features to Box Cox transform" .format (skewness.shape[0 ]))from scipy.special import boxcox1pskewed_features = skewness.index lam = 0.15 for feat in skewed_features: all_data[feat] = boxcox1p(all_data[feat], lam)
输出 :There are 59 skewed numerical features to Box Cox transform
虚拟变量 把所有离散型变量变成one-hot形式的变量。这一步可能会导致数据的变量数量大大增加。
1 2 all_data = pd.get_dummies(all_data) print (all_data.shape)
输出 :(2917, 220)
以上完成了数据清洗的步骤,将合并后的数据集还原成训练集和测试集。
1 2 train = all_data[:ntrain] test = all_data[ntrain:]
训练模型与测试 1 2 3 4 5 6 7 8 9 10 11 from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.kernel_ridge import KernelRidge from sklearn.pipeline import make_pipeline from sklearn.preprocessing import RobustScaler from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone from sklearn.model_selection import KFold, cross_val_score, train_test_split from sklearn.metrics import mean_squared_error import xgboost as xgb import lightgbm as lgb
定义模型评估方法
直接使用cross_val_score
的参数cv=5
没有将数据的顺序随机打乱,因此用了KFold
方法,手动添加了shuffle=True
,再用get_n_splits
返回K折的折数。
random_state
为随机种子,保证每次得到的结果与原作者相同,该参数可以不加。
返回的rmse
是模型在训练集上的均方根误差。由于Y已经对数转换过,因此实际上该值代表的是均方根误差的对数,即RMSLE。
1 2 3 4 5 6 7 n_folds = 5 def rmsle_cv (model ): kf = KFold(n_folds, shuffle=True , random_state=42 ).get_n_splits(train.values) rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error" , cv = kf)) return (rmse)
基础模型 原作者使用了常见的三种带正则化的线性回归模型和Kaggle比赛上大热的几种集成学习模型:GB、XGBoost和LightGBM。
Lasso回归 :该算法对离群值敏感,因此需要加强鲁棒性。本例使用了Robustscaler
这一标准化方法来缩放离群值。
1 lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005 , random_state=1 ))
1 ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005 , l1_ratio=.9 , random_state=3 ))
1 KRR = KernelRidge(alpha=0.6 , kernel='polynomial' , degree=2 , coef0=2.5 )
Gradient Boosting回归 :采用了Huber loss,处理离群值时更加健壮。
1 2 3 4 GBoost = GradientBoostingRegressor(n_estimators=3000 , learning_rate=0.05 , max_depth=4 , max_features='sqrt' , min_samples_leaf=15 , min_samples_split=10 , loss='huber' , random_state =5 )
1 2 3 4 5 6 model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603 , gamma=0.0468 , learning_rate=0.05 , max_depth=3 , min_child_weight=1.7817 , n_estimators=2200 , reg_alpha=0.4640 , reg_lambda=0.8571 , subsample=0.5213 , silent=1 , random_state =7 , nthread = -1 )
1 2 3 4 5 6 model_lgb = lgb.LGBMRegressor(objective='regression' ,num_leaves=5 , learning_rate=0.05 , n_estimators=720 , max_bin = 55 , bagging_fraction = 0.8 , bagging_freq = 5 , feature_fraction = 0.2319 , feature_fraction_seed=9 , bagging_seed=9 , min_data_in_leaf =6 , min_sum_hessian_in_leaf = 11 )
用RMSLE评估各个基础模型在交叉验证集上的得分。
1 2 score = rmsle_cv(lasso) print ("\nLasso score: {:.4f} ({:.4f})\n" .format (score.mean(), score.std()))
输出 :Lasso score: 0.1115 (0.0074)
1 2 score = rmsle_cv(ENet) print ("ElasticNet score: {:.4f} ({:.4f})\n" .format (score.mean(), score.std()))
输出 :ElasticNet score: 0.1116 (0.0074)
1 2 score = rmsle_cv(KRR) print ("Kernel Ridge score: {:.4f} ({:.4f})\n" .format (score.mean(), score.std()))
输出 :Kernel Ridge score: 0.1153 (0.0075)
1 2 score = rmsle_cv(GBoost) print ("Gradient Boosting score: {:.4f} ({:.4f})\n" .format (score.mean(), score.std()))
输出 :Gradient Boosting score: 0.1177 (0.0080)
1 2 score = rmsle_cv(model_xgb) print ("Xgboost score: {:.4f} ({:.4f})\n" .format (score.mean(), score.std()))
输出 :Xgboost score: 0.1151 (0.0069)
1 2 score = rmsle_cv(model_lgb) print ("LGBM score: {:.4f} ({:.4f})\n" .format (score.mean(), score.std()))
输出 :LGBM score: 0.1162 (0.0071)
集成模型 最简单的集成模型:取各个基础模型的均值 新建一个类AveragingModels
用于封装自定义的模型。sklearn.base
中提供了所需继承的父类。
BaseEstimator
:所有预测模型的基类,提供读写参数的方法(__init__()
)。
RegressorMixin
:回归模型的Mixin类,提供返回回归模型$R^2$的方法(predict()
)。
TransformerMixin
:提供拟合数据方法的Mixin类(fit()
)。
clone
:复制一个具有相同参数的预测模型,不复制数据。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class AveragingModels (BaseEstimator, RegressorMixin, TransformerMixin ): def __init__ (self, models ): self.models = models def fit (self, X, y ): self.models_ = [clone(x) for x in self.models] for model in self.models_: model.fit(X, y) return self def predict (self, X ): predictions = np.column_stack([model.predict(X) \ for model in self.models_]) return np.mean(predictions, axis=1 )
用ENet、GBR、KRR和Lasso做一个均值集成模型:
1 2 3 4 averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso)) score = rmsle_cv(averaged_models) print (" Averaged base models score: {:.4f} ({:.4f})\n" .format (score.mean(), score.std()))
输出 :Averaged base models score: 0.1091 (0.0075)
可以看出,即便是最简单的均值集成模型也能取得比单一模型更好的效果。在此基础上尝试更加复杂的集成模型。
进阶:Stacking集成模型 Stacking模型的重点是在均值基础模型上添加一个元模型,并使用元模型进行最终预测。感兴趣的可以看一下这篇kernel:Stacking集成模型入门 。
整个Stacking过程主要分为两层:
第一层通常会采用多种不同的基础模型(称为初级学习器),并采用K折(这里假设K=5),即把原始训练数据集分成5份,则需要进行5次迭代,每次迭代在其中4份上训练每一个初级学习器,用剩下的1份测试。在经历5次迭代之后,在整个原始训练集上都获得了新的预测结果,将其合并,称为次级训练集;每次训练同时都会在整个原始测试集上进行拟合,得到的5个结果取平均值作为次级测试集。
将多个初级学习器得到的次级训练集合并作为输入,原Y作为输出,训练元模型,并用其拟合次级测试集,得到最终预测。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 class StackingAveragedModels (BaseEstimator, RegressorMixin, TransformerMixin ): def __init__ (self, base_models, meta_model, n_folds=5 ): self.base_models = base_models self.meta_model = meta_model self.n_folds = n_folds def fit (self, X, y ): self.base_models_ = [list () for x in self.base_models] self.meta_model_ = clone(self.meta_model) kfold = KFold(n_splits=self.n_folds, shuffle=True , random_state=156 ) out_of_fold_predictions = np.zeros((X.shape[0 ], len (self.base_models))) for i, model in enumerate (self.base_models): for train_index, holdout_index in kfold.split(X, y): instance = clone(model) self.base_models_[i].append(instance) instance.fit(X[train_index], y[train_index]) y_pred = instance.predict(X[holdout_index]) out_of_fold_predictions[holdout_index, i] = y_pred self.meta_model_.fit(out_of_fold_predictions, y) return self def predict (self, X ): meta_features = np.column_stack([ np.column_stack([model.predict(X) for model in base_models]).mean(axis=1 ) for base_models in self.base_models_ ]) return self.meta_model_.predict(meta_features)
使用Enet、KRR和Gboost作为初级学习器,Lasso作为元学习器。
1 2 3 4 stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR), meta_model = lasso) score = rmsle_cv(stacked_averaged_models) print ("Stacking Averaged models score: {:.4f} ({:.4f})" .format (score.mean(), score.std()))
输出 :Stacking Averaged models score: 0.1085 (0.0074)
从RMSLE得分上来看,运用Stacking集成之后得到了更好的验证集精度。
集成Stacking、XGBoost和LightGBM 首先定义RMSLE评分函数:
1 2 def rmsle (y, y_pred ): return np.sqrt(mean_squared_error(y, y_pred))
1 2 3 4 stacked_averaged_models.fit(train.values, y_train) stacked_train_pred = stacked_averaged_models.predict(train.values) stacked_pred = np.expm1(stacked_averaged_models.predict(test.values)) print (rmsle(y_train, stacked_train_pred))
输出 :0.07815719379164626
1 2 3 4 model_xgb.fit(train, y_train) xgb_train_pred = model_xgb.predict(train) xgb_pred = np.expm1(model_xgb.predict(test)) print (rmsle(y_train, xgb_train_pred))
输出 :0.07879894799249872
1 2 3 4 model_lgb.fit(train, y_train) lgb_train_pred = model_lgb.predict(train) lgb_pred = np.expm1(model_lgb.predict(test.values)) print (rmsle(y_train, lgb_train_pred))
输出 :0.07307464036005418
把三个集成学习模型加权平均得到最终预测结果。
关于如何确定权重,原作者没有给出具体的方法,推测是根据RMSLE的高低来分配权重。
1 2 3 print ('RMSLE score on train data:' )print (rmsle(y_train,stacked_train_pred*0.70 + xgb_train_pred*0.15 + lgb_train_pred*0.15 ))
输出 :RMSLE score on train data: 0.07543604303996428
1 ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15
在测试集上运用最终的加权平均,得到的结果输出csv文件后可以提交。原作者最终RMSLE得分为0.11420,截至2017年7月2日排名前4%。
总结 本kernel的以下几点值得学习借鉴:
仔细处理缺失值。
按照实际情况增加了新的变量,转化了一些数值型或标称型变量,以及虚拟化变量。
对偏态分布的变量使用box-cox变换而不是log变换。
运用了Stacking方法,它是Kaggle上取得好名次的一大利器。