kaggle house price

kaggle 竞赛入门
导入常用的数据分析以及模型的库
数据处理
Data fields
Exploratory Data Analysis
目标值转换
- 处理数据中偏态的特征
准备模型训练的数据
- feature importance
训练不同的模型
- 优化参数
stacking method

kaggle 竞赛入门

对于刚刚入门机器学习的的同学来说，kaggle竞赛通常是他们学习和跟其他的全世界范围内的参赛选手切磋的一个大的平台，这个平台上提供了一些入门的竞赛，可以供刚入门的同学一展拳脚
本文针对房价预测的这个竞赛展开，从EDA，特征工程，到模型调参开始讲述一些竞赛中的小的trick，希望对大家有些帮助,本人基础一般，如果有贻笑大方的地方，可以随意拍砖

from IPython.display import HTML

from IPython.display import Image

HTML('''<script>

code_show=true;

function code_toggle() {

 if (code_show){

 $('div.input').hide();

 } else {

 $('div.input').show();

 }

 code_show = !code_show

}

$( document ).ready(code_toggle);

</script>

<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

导入常用的数据分析以及模型的库

import pandas as pd

import numpy as np

!ls

data_description.txt

data_description.zip

kaggle house price.ipynb

sample_submission.csv

stacking-house-prices-walkthrough-to-top-5.ipynb

test.csv

train.csv

train = pd.read_csv('train.csv')

test = pd.read_csv('test.csv')

train.head()

0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

train.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1460 entries, 0 to 1459

Data columns (total 81 columns):

Id               1460 non-null int64

MSSubClass       1460 non-null int64

MSZoning         1460 non-null object

LotFrontage      1201 non-null float64

LotArea          1460 non-null int64

Street           1460 non-null object

Alley            91 non-null object

LotShape         1460 non-null object

LandContour      1460 non-null object

Utilities        1460 non-null object

LotConfig        1460 non-null object

LandSlope        1460 non-null object

Neighborhood     1460 non-null object

Condition1       1460 non-null object

Condition2       1460 non-null object

BldgType         1460 non-null object

HouseStyle       1460 non-null object

OverallQual      1460 non-null int64

OverallCond      1460 non-null int64

YearBuilt        1460 non-null int64

YearRemodAdd     1460 non-null int64

RoofStyle        1460 non-null object

RoofMatl         1460 non-null object

Exterior1st      1460 non-null object

Exterior2nd      1460 non-null object

MasVnrType       1452 non-null object

MasVnrArea       1452 non-null float64

ExterQual        1460 non-null object

ExterCond        1460 non-null object

Foundation       1460 non-null object

BsmtQual         1423 non-null object

BsmtCond         1423 non-null object

BsmtExposure     1422 non-null object

BsmtFinType1     1423 non-null object

BsmtFinSF1       1460 non-null int64

BsmtFinType2     1422 non-null object

BsmtFinSF2       1460 non-null int64

BsmtUnfSF        1460 non-null int64

TotalBsmtSF      1460 non-null int64

Heating          1460 non-null object

HeatingQC        1460 non-null object

CentralAir       1460 non-null object

Electrical       1459 non-null object

1stFlrSF         1460 non-null int64

2ndFlrSF         1460 non-null int64

LowQualFinSF     1460 non-null int64

GrLivArea        1460 non-null int64

BsmtFullBath     1460 non-null int64

BsmtHalfBath     1460 non-null int64

FullBath         1460 non-null int64

HalfBath         1460 non-null int64

BedroomAbvGr     1460 non-null int64

KitchenAbvGr     1460 non-null int64

KitchenQual      1460 non-null object

TotRmsAbvGrd     1460 non-null int64

Functional       1460 non-null object

Fireplaces       1460 non-null int64

FireplaceQu      770 non-null object

GarageType       1379 non-null object

GarageYrBlt      1379 non-null float64

GarageFinish     1379 non-null object

GarageCars       1460 non-null int64

GarageArea       1460 non-null int64

GarageQual       1379 non-null object

GarageCond       1379 non-null object

PavedDrive       1460 non-null object

WoodDeckSF       1460 non-null int64

OpenPorchSF      1460 non-null int64

EnclosedPorch    1460 non-null int64

3SsnPorch        1460 non-null int64

ScreenPorch      1460 non-null int64

PoolArea         1460 non-null int64

PoolQC           7 non-null object

Fence            281 non-null object

MiscFeature      54 non-null object

MiscVal          1460 non-null int64

MoSold           1460 non-null int64

YrSold           1460 non-null int64

SaleType         1460 non-null object

SaleCondition    1460 non-null object

SalePrice        1460 non-null int64

dtypes: float64(3), int64(35), object(43)

memory usage: 924.0+ KB

print(train.shape)

print(test.shape)

(1460, 81)

(1459, 80)

数据结构类似于波士顿房屋的价格数据，其中该数据集中有79个特征，来描述房屋，可以通过数据描述来查看对应字段的意义
同时本文也将缺失值处理的方法进行阐述
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object 以上三个特征缺失较为明显，后文将有对应的对缺失值处理的方法

数据处理

kaggle house price-LMLPHP

异常值通常是指在预期的值之外，至于如何处理异常值，怎么界定异常值，取决于个人和特定的问题
对于异常值通常会在数据分布点之外，因此通常会让计算的结果和数据的分布
以下图为例

kaggle house price-LMLPHP

with open ('data_description.txt','r') as f:

    for i in f.readlines():

        print(i)

        break

MSSubClass: Identifies the type of dwelling involved in the sale.

Data fields

Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale
首先看这个特征 GrLivArea: Above grade (ground) living area square feet,是指居住面积平方英尺

去除异常值

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

sns.set(style='white', context='notebook', palette='deep')

plt.subplots(figsize=(15,8))

plt.subplot(1,2,1)

g= sns.regplot(x=train['GrLivArea'],y= train['SalePrice'],fit_reg=False).set_title('Before')

plt.subplot(1,2,2)

train= train.drop(train[train['GrLivArea']>4000].index)

g=sns.regplot(x=train['GrLivArea'],y=train['SalePrice'],fit_reg=False).set_title('After')

kaggle house price-LMLPHP

从以上图中可以发现，居住面积大于4000的样本总共有4个，且这个四个属于严重的偏离分布

处理缺失值

缺失值可能是由于人工输入错误，机器误差等问题导致的
有些例子中的缺失值可以使用0进行填充，前提是需要知道该特征代表的意义，缺失即代表0
实际情况中，填充0并不总是最好的办法，而且针对不同的算法，对于缺失值处理的能力不同，本文需要使用多种算法进行拟合房价，因此如何正确处理缺失值呢，一般有两种方法：
- 直接删掉带有缺失值的列
- 填充缺失值

# 首先先把训练数据与测试数据的长度保持，以备后用

ntrain = train.shape[0]

ntest = test.shape[0]

# 保持训练集的目标值数据即 SalePrice

y_train = train.SalePrice.values

all_data = pd.concat((train,test)).reset_index(drop=True)

all_data.drop(['SalePrice'],axis=1,inplace=True)

all_data.drop(['Id'],axis=1,inplace=True)

print('all data shape:{}'.format(all_data.shape))

all data shape:(2915, 79)

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:7: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  import sys

all_data_na = all_data.isnull().sum()

all_data_na.sort_values(ascending=False)

PoolQC           2907

MiscFeature      2810

Alley            2717

Fence            2345

FireplaceQu      1420

LotFrontage       486

GarageFinish      159

GarageQual        159

GarageYrBlt       159

GarageCond        159

GarageType        157

BsmtCond           82

BsmtExposure       82

BsmtQual           81

BsmtFinType2       80

BsmtFinType1       79

MasVnrType         24

MasVnrArea         23

MSZoning            4

BsmtHalfBath        2

Utilities           2

Functional          2

BsmtFullBath        2

Electrical          1

Exterior2nd         1

KitchenQual         1

GarageCars          1

Exterior1st         1

GarageArea          1

TotalBsmtSF         1

                 ...

GrLivArea           0

YearRemodAdd        0

YearBuilt           0

WoodDeckSF          0

TotRmsAbvGrd        0

Street              0

ScreenPorch         0

SaleCondition       0

RoofStyle           0

RoofMatl            0

PoolArea            0

PavedDrive          0

OverallQual         0

OverallCond         0

OpenPorchSF         0

Neighborhood        0

MoSold              0

MiscVal             0

MSSubClass          0

LowQualFinSF        0

LotShape            0

LotConfig           0

LotArea             0

LandSlope           0

LandContour         0

KitchenAbvGr        0

HouseStyle          0

HeatingQC           0

Heating             0

1stFlrSF            0

Length: 79, dtype: int64

all_data_na = all_data_na.drop(all_data_na[all_data_na==0].index).sort_values(ascending=False)

plt.subplots(figsize=(12,6))

all_data_na.plot(kind='Bar')

<matplotlib.axes._subplots.AxesSubplot at 0x128568710>

kaggle house price-LMLPHP

参考链接

!pip install xgboost

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

Requirement already satisfied: xgboost in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (0.90)

Requirement already satisfied: numpy in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (from xgboost) (1.16.2)

Requirement already satisfied: scipy in /Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages (from xgboost) (1.2.1)

train[all_data_na.index[:25]].info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 1456 entries, 0 to 1459

Data columns (total 25 columns):

PoolQC          5 non-null object

MiscFeature     54 non-null object

Alley           91 non-null object

Fence           280 non-null object

FireplaceQu     766 non-null object

LotFrontage     1197 non-null float64

GarageQual      1375 non-null object

GarageCond      1375 non-null object

GarageFinish    1375 non-null object

GarageYrBlt     1375 non-null float64

GarageType      1375 non-null object

BsmtExposure    1418 non-null object

BsmtCond        1419 non-null object

BsmtQual        1419 non-null object

BsmtFinType2    1418 non-null object

BsmtFinType1    1419 non-null object

MasVnrType      1448 non-null object

MasVnrArea      1448 non-null float64

MSZoning        1456 non-null object

BsmtFullBath    1456 non-null int64

BsmtHalfBath    1456 non-null int64

Utilities       1456 non-null object

Functional      1456 non-null object

Electrical      1455 non-null object

BsmtUnfSF       1456 non-null int64

dtypes: float64(3), int64(3), object(19)

memory usage: 295.8+ KB

for category feature we,fill these missing values with "None"
for float feature and the number of missing values seemingly much larger ,we fill these missing values with median of the feature
for float feature and the number of missing values smaller, we will fill these missing values with mode

for col in ("PoolQC", 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageQual', 'GarageCond',

            'GarageFinish', 'GarageType','BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1',

           'MasVnrType'):

    all_data[col] = all_data[col].fillna('None')

print('处理object类型缺失值，根据特征的描述，特征缺失值补充为"None"，已完成')

for col in ("GarageYrBlt", "GarageArea", "GarageCars", "BsmtFinSF1",

           "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "MasVnrArea",

           "BsmtFullBath", "BsmtHalfBath"):

    all_data[col] = all_data[col].fillna(0)

print('处理数值类型的缺失值，根据特征的描述，选择特征缺失值补充为0，已完成')

all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])

all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])

all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])

all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])

all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

all_data["Functional"] = all_data["Functional"].fillna(all_data['Functional'].mode()[0])

print('处理缺失值较少的缺失值，数据类型为数值，填充缺失值为该特征的众数，已完成')

all_data_na = all_data.isnull().sum()

print("Features with missing values: ", all_data_na.drop(all_data_na[all_data_na == 0].index))

处理object类型缺失值，根据特征的描述，特征缺失值补充为"None"，已完成

处理数值类型的缺失值，根据特征的描述，选择特征缺失值补充为0，已完成

处理缺失值较少的缺失值，数据类型为数值，填充缺失值为该特征的众数，已完成

Features with missing values:  LotFrontage    486

Utilities        2

dtype: int64

all_data.groupby(["Neighborhood"])['LotFrontage'].sum()

Neighborhood

Blmngtn      938.0

Blueste      273.0

BrDale       645.0

BrkSide     5300.0

ClearCr     1763.0

CollgCr    15694.0

Crawfor     5806.0

Edwards    11467.0

Gilbert     8237.0

IDOTRR      5415.0

MeadowV      845.0

Mitchel     6763.0

NAmes      28204.0

NPkVill      591.0

NWAmes      6929.0

NoRidge     4684.0

NridgHt    13722.0

OldTown    14147.0

SWISU       2599.0

Sawyer      7306.0

SawyerW     7491.0

Somerst    10457.0

StoneBr     2860.0

Timber      4626.0

Veenker     1152.0

Name: LotFrontage, dtype: float64

all_data['LotFrontage']=all_data.groupby("Neighborhood")["LotFrontage"].transform(

    lambda x: x.fillna(x.median()))

分析 Utilities

plt.subplots(figsize=(12,5))

plt.subplot(1,2,1)

g=sns.countplot(x='Utilities',data=train).set_title('Utilities_train')

plt.subplot(1,2,2)

g=sns.countplot(x='Utilities',data=test).set_title('Utilities_test')

kaggle house price-LMLPHP

train['Utilities'].value_counts()

AllPub    1455

NoSeWa       1

Name: Utilities, dtype: int64

test['Utilities'].value_counts()

AllPub    1457

Name: Utilities, dtype: int64

all_data = all_data.drop(['Utilities'], axis=1)

all_data_na = all_data.isnull().sum()

print("Features with missing values: ", len(all_data_na.drop(all_data_na[all_data_na == 0].index)))

Features with missing values:  0

Exploratory Data Analysis

Correlation matrix

异常值与缺失值已经处理完毕，进一步需要特征之间与特征与目标值之间的关系，相关系数矩阵就是提供了反应特征与目标值之间关系的一个参考

corr = train.corr()

plt.subplots(figsize=(30,30))

cmap = sns.diverging_palette(150, 250, as_cmap=True)

sns.heatmap(corr, cmap="RdYlBu", vmax=1, vmin=-0.6, center=0.2, square=True, linewidths=0, cbar_kws={"shrink": .5}, annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x12901bc18>

kaggle house price-LMLPHP

for raw highly influencing factors on SalePrice, we could do feature engineering
从相关系数矩阵中，我们挑选了一些跟最终售价相关性较高的做进一步的分析
主要的影响因素有以下几个:

OverallQual Overall material and finish quality 整体的物料以及完成质量
GrLivArea Above grade (ground) living area square feet 地面以上的居住面积平方英尺
GarageCars Size of garage in car capacity 停车场的大小，可以放几辆车
GarageArea Size of garage in square feet 停车场的面积大小
TotalBsmtSF Total square feet of basement area 地下室的面积平方英尺
1stFlrSF First Floor square feet 一楼的面积平方英尺
FullBath Full bathrooms above grade 地上卫生间
TotRmsAbvGrd Total rooms above grade (does not include bathrooms) 地上去掉卫生间的房屋数
Fireplaces 壁炉数量
MasVnrArea Masonry veneer area in square feet 粗略可以理解为石灰结构的建筑面积
BsmtFinSF1 Quality of basement finished area Type 1 finished square feet地下室的完成面积
LotFrontage Linear feet of street connected to property 距离街道的距离
WoodDeckSF Wood deck area in square feet 木质结构的建筑面积
OpenPorchSF Open porch area in square feet 开放式门廊的面积
2ndFlrSF Second floor square feet 二楼的面积

# Quadratic

all_data["OverallQual-2"] = all_data["OverallQual"] ** 2

all_data["GrLivArea-2"] = all_data["GrLivArea"] ** 2

all_data["GarageCars-2"] = all_data["GarageCars"] ** 2

all_data["GarageArea-2"] = all_data["GarageArea"] ** 2

all_data["TotalBsmtSF-2"] = all_data["TotalBsmtSF"] ** 2

all_data["1stFlrSF-2"] = all_data["1stFlrSF"] ** 2

all_data["FullBath-2"] = all_data["FullBath"] ** 2

all_data["TotRmsAbvGrd-2"] = all_data["TotRmsAbvGrd"] ** 2

all_data["Fireplaces-2"] = all_data["Fireplaces"] ** 2

all_data["MasVnrArea-2"] = all_data["MasVnrArea"] ** 2

all_data["BsmtFinSF1-2"] = all_data["BsmtFinSF1"] ** 2

all_data["LotFrontage-2"] = all_data["LotFrontage"] ** 2

all_data["WoodDeckSF-2"] = all_data["WoodDeckSF"] ** 2

all_data["OpenPorchSF-2"] = all_data["OpenPorchSF"] ** 2

all_data["2ndFlrSF-2"] = all_data["2ndFlrSF"] ** 2

print("Quadratics done!...")

# Cubic

all_data["OverallQual-23"] = all_data["OverallQual"] ** 3

all_data["GrLivArea-3"] = all_data["GrLivArea"] ** 3

all_data["GarageCars-3"] = all_data["GarageCars"] **3

all_data["GarageArea-3"] = all_data["GarageArea"] ** 3

all_data["TotalBsmtSF-3"] = all_data["TotalBsmtSF"] ** 3

all_data["1stFlrSF-3"] = all_data["1stFlrSF"] ** 3

all_data["FullBath-3"] = all_data["FullBath"] ** 3

all_data["TotRmsAbvGrd-3"] = all_data["TotRmsAbvGrd"] ** 3

all_data["Fireplaces-3"] = all_data["Fireplaces"] ** 3

all_data["MasVnrArea-3"] = all_data["MasVnrArea"] ** 3

all_data["BsmtFinSF1-3"] = all_data["BsmtFinSF1"] ** 3

all_data["LotFrontage-3"] = all_data["LotFrontage"] ** 3

all_data["WoodDeckSF-3"] = all_data["WoodDeckSF"] ** 3

all_data["OpenPorchSF-3"]=all_data["OpenPorchSF"] ** 3

all_data["2ndFlrSF-3"]= all_data["2ndFlrSF"] ** 3

print("Quadratics done!...")

# Square Root

all_data["OverallQual-Sq"] = np.sqrt(all_data["OverallQual"])

all_data["GrLivArea-Sq"] = np.sqrt(all_data["GrLivArea"])

all_data["GarageCars-Sq"] = np.sqrt(all_data["GarageCars"])

all_data["GarageArea-Sq"] = np.sqrt(all_data["GarageArea"])

all_data["TotalBsmtSF-Sq"] = np.sqrt(all_data["TotalBsmtSF"])

all_data["1stFlrSF-Sq"] = np.sqrt(all_data["1stFlrSF"])

all_data["FullBath-Sq"] = np.sqrt(all_data["FullBath"])

all_data["TotRmsAbvGrd-Sq"] = np.sqrt(all_data["TotRmsAbvGrd"])

all_data["Fireplaces-Sq"] = np.sqrt(all_data["Fireplaces"])

all_data["MasVnrArea-Sq"] = np.sqrt(all_data["MasVnrArea"])

all_data["BsmtFinSF1-Sq"] = np.sqrt(all_data["BsmtFinSF1"])

all_data["LotFrontage-Sq"] = np.sqrt(all_data["LotFrontage"])

all_data["WoodDeckSF-Sq"] = np.sqrt(all_data["WoodDeckSF"])

all_data["OpenPorchSF-Sq"] = np.sqrt(all_data["OpenPorchSF"])

all_data["2ndFlrSF-Sq"] = np.sqrt(all_data["2ndFlrSF"])

print("Roots done!...")

Quadratics done!...

Quadratics done!...

Roots done!...

BsmtQual

train['BsmtQual'].value_counts()

TA    649

Gd    618

Ex    117

Fa     35

Name: BsmtQual, dtype: int64

train.groupby(['BsmtQual'])['SalePrice'].mean()

"""

BsmtQual: Evaluates the height of the basement

       Ex	Excellent (100+ inches)

       Gd	Good (90-99 inches)

       TA	Typical (80-89 inches)

       Fa	Fair (70-79 inches)

       Po	Poor (<70 inches

       NA	No Basement

"""

'\nBsmtQual: Evaluates the height of the basement\n\n       Ex\tExcellent (100+ inches)\t\n       Gd\tGood (90-99 inches)\n       TA\tTypical (80-89 inches)\n       Fa\tFair (70-79 inches)\n       Po\tPoor (<70 inches\n       NA\tNo Basement\n'

plt.subplots(figsize=(20,6))

plt.subplot(1,3,1)# 箱形图

sns.boxplot(x='BsmtQual',y='SalePrice',data=train,order= ['Fa', 'TA', 'Gd', 'Ex'])

plt.subplot(1,3,2) # x轴里的类别进行分类

sns.stripplot(x='BsmtQual',y='SalePrice',data=train,size=5,jitter=True,order= ['Fa', 'TA', 'Gd', 'Ex'])

plt.subplot(1,3,3) # 柱状图

sns.barplot(x='BsmtQual',y='SalePrice',data=train,order= ['Fa', 'TA', 'Gd', 'Ex'],estimator=np.mean)

<matplotlib.axes._subplots.AxesSubplot at 0x1263d5e10>

kaggle house price-LMLPHP

all_data['BsmtQual'] = all_data['BsmtQual'].map({"None":0, "Fa":1, "TA":2, "Gd":3, "Ex":4})

all_data['BsmtQual'].unique()

array([3, 2, 4, 0, 1])

all_data['BsmtQual'].value_counts()

2    1283

3    1209

4     254

1      88

0      81

Name: BsmtQual, dtype: int64

很明显，该特征能够显著的影响销售价格，而且越高的的地下室，对应的价格也越高
typical and good 两个分部数量较大，占比较高
可以将该特征的变量是有高低好坏之分的，也就是category 特征的顺序性，可以转化为数字(个人觉得意义不大）

BsmtCond

"""

BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent

       Gd	Good

       TA	Typical - slight dampness allowed

       Fa	Fair - dampness or some cracking or settling

       Po	Poor - Severe cracking, settling, or wetness

       NA	No Basement

"""

'\nBsmtCond: Evaluates the general condition of the basement\n\n       Ex\tExcellent\n       Gd\tGood\n       TA\tTypical - slight dampness allowed\n       Fa\tFair - dampness or some cracking or settling\n       Po\tPoor - Severe cracking, settling, or wetness\n       NA\tNo Basement\n'

plt.subplots(figsize=(20,5))

plt.subplot(1,3,1)

sns.boxplot(x='BsmtCond',y='SalePrice',data=train,order=['Po','Fa','TA','Gd'])

plt.subplot(1,3,2)

sns.stripplot(x='BsmtCond',y='SalePrice',data=train,size=5,jitter=True,order= ['Po','Fa','TA','Gd'])

plt.subplot(1,3,3)

sns.barplot(x='BsmtCond',y='SalePrice',data=train,order=['Po','Fa','TA','Gd'])

<matplotlib.axes._subplots.AxesSubplot at 0x12ab8d6d8>

kaggle house price-LMLPHP

train['BsmtCond'].value_counts()

TA    1307

Gd      65

Fa      45

Po       2

Name: BsmtCond, dtype: int64

图二中的Typical样本数据占比较高，从barplot中可以看出该特征能够很明显的影响售出价格
针对图一种的TA价格较为分散，价格分布离散

all_data['BsmtCond'] = all_data['BsmtCond'].map({"None":0, "Po":1, "Fa":2, "TA":3,"Gd":4, "Ex":5})

all_data['BsmtCond'].unique()

array([3, 4, 0, 2, 1])

BsmtExplosure

"""

BsmtExposure: Refers to walkout or garden level walls

       Gd	Good Exposure

       Av	Average Exposure (split levels or foyers typically score average or above)

       Mn	Mimimum Exposure

       No	No Exposure

       NA	No Basement

"""

'\nBsmtExposure: Refers to walkout or garden level walls\n\n       Gd\tGood Exposure\n       Av\tAverage Exposure (split levels or foyers typically score average or above)\t\n       Mn\tMimimum Exposure\n       No\tNo Exposure\n       NA\tNo Basement\n\n'

plt.subplots(figsize=(20,5))

plt.subplot(1,3,1)

sns.boxplot(x='BsmtExposure',y='SalePrice',data=train,order=['No','Mn','Av','Gd'])

plt.subplot(1,3,2)

sns.stripplot(x='BsmtExposure',y='SalePrice',data=train,size=5,jitter=True,order= ['No','Mn','Av','Gd'])

plt.subplot(1,3,3)

sns.barplot(x='BsmtExposure',y='SalePrice',data=train,order=['No','Mn','Av','Gd'])

<matplotlib.axes._subplots.AxesSubplot at 0x12b8e4470>

kaggle house price-LMLPHP

all_data['BsmtExposure'] = all_data['BsmtExposure'].map({"None":0, "No":1, "Mn":2, "Av":3,"Gd":4})

all_data['BsmtExposure'].unique()

array([1, 4, 2, 3, 0])

BsmtFinType1

"""

BsmtFinType1: Rating of basement finished area

       GLQ	Good Living Quarters

       ALQ	Average Living Quarters

       BLQ	Below Average Living Quarters

       Rec	Average Rec Room

       LwQ	Low Quality

       Unf	Unfinshed

       NA	No Basement

"""

'\nBsmtFinType1: Rating of basement finished area\n\n       GLQ\tGood Living Quarters\n       ALQ\tAverage Living Quarters\n       BLQ\tBelow Average Living Quarters\t\n       Rec\tAverage Rec Room\n       LwQ\tLow Quality\n       Unf\tUnfinshed\n       NA\tNo Basement\n'

plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)

sns.boxplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);

plt.subplot(1, 3, 2)

sns.stripplot(x="BsmtFinType1", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);

plt.subplot(1, 3, 3)

sns.barplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);

kaggle house price-LMLPHP

可以从图一中看出，很多没有装修完的地下室房屋的价格很高
从图三中可以看到，这些category 不是按照顺序的提高，房屋的销售价提高与category的顺序没有必然关系
因此将这个特征进行one-hot转化，可以使用pandas 中的get_dummy函数进行转化

all_data = pd.get_dummies(all_data, columns = ["BsmtFinType1"], prefix="BsmtFinType1")

all_data.head(3)

0	856	854	None	3	1Fam	3	1	706.0	...	0.000000	7.810250	29.223278	0	1
1	1262	0	None	3	1Fam	3	4	978.0	...	17.262677	0.000000	0.000000	1	0
2	920	866	None	3	1Fam	3	2	486.0	...	0.000000	6.480741	29.427878	0	1

3 rows × 129 columns

BsmtFinSF1

BsmtFinSF1: Type 1 finished square feet

from scipy.stats.stats import pearsonr

grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)

# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数（例如，左，右等）可以选择性调整。

plt.subplots(figsize=(30,15))

plt.subplot(grid[0,0])

g = sns.regplot(x=train['BsmtFinSF1'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF1'], train['SalePrice'])[0]))

# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))

g.legend(loc='best')

plt.subplot(grid[0,1:])

sns.boxplot(x='Neighborhood',y='BsmtFinSF1',data=train)

plt.subplot(grid[1,0])

sns.barplot(x='BldgType',y= 'BsmtFinSF1',data=train)

plt.subplot(grid[1,1])

sns.barplot(x='HouseStyle',y ='BsmtFinSF1',data=train)

plt.subplot(grid[1,2])

sns.barplot(x='LotShape',y='BsmtFinSF1',data=train)

<matplotlib.axes._subplots.AxesSubplot at 0x129034e10>

kaggle house price-LMLPHP

地下室完成面积对于销售价格来说影响很大，但是对于Neighborhood以及BldgType houseType LotShape 影响各异，这三个因素对于完成面积影响没有规律可循
但是特征是连续的数值特质，因此考虑将其进行切割分组

bins = [-5,1000,2000,3000,float('inf')]

all_data['BsmtFinSF1_Band'] = pd.cut(all_data['BsmtFinSF1'], bins,labels=['1','2','3','4'])

all_data['BsmtFinSF1_Band'].unique()

all_data.drop('BsmtFinSF1',axis=1,inplace=True)

all_data = pd.get_dummies(all_data, columns = ["BsmtFinSF1_Band"], prefix="BsmtFinSF1")

all_data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

0	856	854	None	3	1Fam	3	1	Unf	...	1	1
1	1262	0	None	3	1Fam	3	4	Unf	...	0	1
2	920	866	None	3	1Fam	3	2	Unf	...	1	1
3	961	756	None	3	1Fam	4	1	Unf	...	0	1
4	1145	1053	None	4	1Fam	3	3	Unf	...	1	1

5 rows × 132 columns

BsmtFinType2

"""

BsmtFinType2: Rating of basement finished area (if multiple types)

       GLQ	Good Living Quarters

       ALQ	Average Living Quarters

       BLQ	Below Average Living Quarters

       Rec	Average Rec Room

       LwQ	Low Quality

       Unf	Unfinshed

       NA	No Basement

"""

'\nBsmtFinType2: Rating of basement finished area (if multiple types)\n\n       GLQ\tGood Living Quarters\n       ALQ\tAverage Living Quarters\n       BLQ\tBelow Average Living Quarters\t\n       Rec\tAverage Rec Room\n       LwQ\tLow Quality\n       Unf\tUnfinshed\n       NA\tNo Basement\n\n'

plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)

sns.boxplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);

plt.subplot(1, 3, 2)

sns.stripplot(x="BsmtFinType2", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);

plt.subplot(1, 3, 3)

sns.barplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]);

kaggle house price-LMLPHP

很多房子的第二个地下室没有装修完工，且价格分化很大
第二个装修的地下室的装修好坏对于价格影响没有像之前的那样的顺序关系(图三)
因此，需要将该特征转化为one-hot哑变量

all_data = pd.get_dummies(all_data, columns = ["BsmtFinType2"], prefix="BsmtFinType2")  # columns 参数要传入列表

all_data.head(3)

"""

columns : list-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.

"""

'\ncolumns : list-like, default None\nColumn names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.\n\n'

BsmtFinSF2

"""

BsmtFinSF2: Type 2 finished square feet

"""

grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)

# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数（例如，左，右等）可以选择性调整。

plt.subplots(figsize=(30,15))

plt.subplot(grid[0,0])

g = sns.regplot(x=train['BsmtFinSF2'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF2'], train['SalePrice'])[0]))

# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))

g.legend(loc='best')

plt.subplot(grid[0,1:])

sns.boxplot(x='Neighborhood',y='BsmtFinSF2',data=train)

plt.subplot(grid[1,0])

sns.barplot(x='BldgType',y= 'BsmtFinSF2',data=train)

plt.subplot(grid[1,1])

sns.barplot(x='HouseStyle',y ='BsmtFinSF2',data=train)

plt.subplot(grid[1,2])

sns.barplot(x='LotShape',y='BsmtFinSF2',data=train)

<matplotlib.axes._subplots.AxesSubplot at 0x12c7a68d0>

kaggle house price-LMLPHP

已装修完成的第二个地下室的面积与销售价格没有明显的关系
而且大部分的数据都是未完成装修的，与上一个特征相关性较高
可以采用是否完成装修来转化该特征（类似于缺失值的补充，变成是否缺失）

all_data['BsmtFinType2_None'].value_counts()

0    2835

1      80

Name: BsmtFinType2_None, dtype: int64

all_data['BsmtFinSf2_Flag'] = all_data['BsmtFinSF2'].map(lambda x:0 if x==0 else 1)

all_data.drop('BsmtFinSF2', axis=1, inplace=True)

all_data['BsmtFinSf2_Flag'].value_counts()

0    2568

1     347

Name: BsmtFinSf2_Flag, dtype: int64

BsmtUnfSF

"""

Unfinished square feet of basement area

"""

grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)

# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数（例如，左，右等）可以选择性调整。

plt.subplots(figsize=(30,15))

plt.subplot(grid[0,0])

g = sns.regplot(x=train['BsmtUnfSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtUnfSF'], train['SalePrice'])[0]))

# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))

g.legend(loc='best')

plt.subplot(grid[0,1:])

sns.boxplot(x='Neighborhood',y='BsmtUnfSF',data=train)

plt.subplot(grid[1,0])

sns.barplot(x='BldgType',y= 'BsmtUnfSF',data=train)

plt.subplot(grid[1,1])

sns.barplot(x='HouseStyle',y ='BsmtUnfSF',data=train)

plt.subplot(grid[1,2])

sns.barplot(x='LotShape',y='BsmtUnfSF',data=train)

<matplotlib.axes._subplots.AxesSubplot at 0x118d8b940>

kaggle house price-LMLPHP

"""

This feature has a significant positive correlation with SalePrice, with a small proportion of data points having a value of 0.

This tells me that most houses will have some amount of square feet unfinished within the basement, and this actually positively contributes towards SalePrice.

The amount of unfinished square feet also varies widely based on location and style.

Whereas the average unfinished square feet within the basement is fairly consistent across the different lot shapes.

Since this is a continuous numeric feature with a significant correlation, I will bin this and create dummy variables.

与售价正相关，

Unfinished square feet of basement area 与lot shape 没啥关系

连续值变量，需要进行封箱操作，然后将封箱之后的特征进行one-hot转化

"""

all_data['BsmtUnfSF_Band'] = pd.cut(all_data['BsmtUnfSF'], 3,labels=['1','2','3'])

all_data.drop('BsmtUnfSF',axis=1,inplace=True)

all_data['BsmtUnfSF_Band'].unique()

all_data['BsmtUnfSF_Band'] = all_data['BsmtUnfSF_Band'].astype(int)

all_data = pd.get_dummies(all_data, columns = ["BsmtUnfSF_Band"], prefix="BsmtUnfSF")

all_data.head()

0	856	854	None	3	1Fam	3	1	1.0	0.0	...	1	1
1	1262	0	None	3	1Fam	3	4	0.0	1.0	...	1	1
2	920	866	None	3	1Fam	3	2	1.0	0.0	...	1	1
3	961	756	None	3	1Fam	4	1	1.0	0.0	...	1	1
4	1145	1053	None	4	1Fam	3	3	1.0	0.0	...	1	1

5 rows × 140 columns

TotalBsmtSF

"""

Total square feet of basement area.

"""

grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)

# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数（例如，左，右等）可以选择性调整。

plt.subplots(figsize=(30,15))

plt.subplot(grid[0,0])

g = sns.regplot(x=train['TotalBsmtSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['TotalBsmtSF'], train['SalePrice'])[0]))

# g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))

g.legend(loc='best')

plt.subplot(grid[0,1:])

sns.boxplot(x='Neighborhood',y='TotalBsmtSF',data=train)

plt.subplot(grid[1,0])

sns.barplot(x='BldgType',y= 'TotalBsmtSF',data=train)

plt.subplot(grid[1,1])

sns.barplot(x='HouseStyle',y ='TotalBsmtSF',data=train)

plt.subplot(grid[1,2])

sns.barplot(x='LotShape',y='TotalBsmtSF',data=train)

<matplotlib.axes._subplots.AxesSubplot at 0x12d9b3d30>

kaggle house price-LMLPHP

def get_feature_corr(feature_name):

    grid = plt.GridSpec(2,3,wspace=0.15,hspace=0.25)

# 创建画布指定子图将放置的网格的几何位置。 需要设置网格的行数和列数。 子图布局参数（例如，左，右等）可以选择性调整。

    plt.subplots(figsize=(30,15))

    plt.subplot(grid[0,0])

    g = sns.regplot(x=train[feature_name], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train[feature_name], train['SalePrice'])[0]))

    # g= sns.regplot(x=train['BsmtFinSF1'],y=train["SalePrice"],fit_reg==False,label= "Corr:%2f" %(pearsonr(train['BsmtFinType1'],train['SalePrice'])[0]))

    g.legend(loc='best')

    plt.subplot(grid[0,1:])

    sns.boxplot(x='Neighborhood',y=feature_name,data=train)

    plt.subplot(grid[1,0])

    sns.barplot(x='BldgType',y= feature_name,data=train)

    plt.subplot(grid[1,1])

    sns.barplot(x='HouseStyle',y =feature_name,data=train)

    plt.subplot(grid[1,2])

    sns.barplot(x='LotShape',y=feature_name,data=train)

    plt.show()

1stFlrSF

get_feature_corr('1stFlrSF')

"""

First floor square feet.

"""

kaggle house price-LMLPHP

'\nFirst floor square feet.\n'

第一层的面积与售价有着很强的相关性
不同的街区对于第一层的面积分布范围变化很大
对于不同的房型，第一层的面积变化不大
该特征为连续值，需要进行封箱然后one-hot转化

all_data['1stFlrSF_Band'] = pd.cut(all_data['1stFlrSF'], 6,labels=['1','2','3','4','5','6'])

all_data['1stFlrSF_Band'].unique()

all_data['1stFlrSF_Band'] = all_data['1stFlrSF_Band'].astype(int)

all_data.drop('1stFlrSF', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["1stFlrSF_Band"], prefix="1stFlrSF")

all_data.head(3)

0	854	None	3	1Fam	3	1	1.0	0.0	3	...	1	1	0
1	0	None	3	1Fam	3	4	0.0	1.0	3	...	1	0	1
2	866	None	3	1Fam	3	2	1.0	0.0	3	...	1	1	0

3 rows × 145 columns

2ndFlrSF

get_feature_corr('2ndFlrSF')

"""

Second floor square feet.

"""

kaggle house price-LMLPHP

'\nSecond floor square feet.\n'

很多房子没有第二层，所有很多房子的第二层面积为0
第二层面积与街区的变化很大
对于不同的房型，第二层的面积变化很大
连续值变量，进行封箱，然后进行one-hot转化

all_data['2ndFlrSF_Band'] = pd.cut(all_data['2ndFlrSF'], 6,labels=list('123456'))

all_data['2ndFlrSF_Band'].unique()

all_data=pd.get_dummies(all_data,columns=['2ndFlrSF_Band'],prefix="2ndFlrSF")

all_data.drop('2ndFlrSF', axis=1, inplace=True)

all_data.head()

0	None	3	1Fam	3	1	1.0	0.0	3	Y	...	0	1	0
1	None	3	1Fam	3	4	0.0	1.0	3	Y	...	1	0	0
2	None	3	1Fam	3	2	1.0	0.0	3	Y	...	0	1	0
3	None	3	1Fam	4	1	1.0	0.0	2	Y	...	0	1	0
4	None	4	1Fam	3	3	1.0	0.0	3	Y	...	0	0	1

5 rows × 150 columns

LowQualFinSF

get_feature_corr('LowQualFinSF')

'''

Low quality finished square feet (all floors)

'''

kaggle house price-LMLPHP

'\nLow quality finished square feet (all floors)\n'

针对该特征可以将特征转化为0-1

all_data['LowQualFinSF_Flag'] = all_data['LowQualFinSF'].map(lambda x:0 if x==0 else 1)

all_data.drop('LowQualFinSF', axis=1, inplace=True)

BsmtHalfBath BsmtFullBath HalfBath FullBath

all_data['TotalBathrooms'] = all_data['BsmtHalfBath'] + all_data['BsmtFullBath'] + all_data['HalfBath'] + all_data['FullBath']

columns = ['BsmtHalfBath', 'BsmtFullBath', 'HalfBath', 'FullBath']

all_data.drop(columns, axis=1, inplace=True)

def get_feature_corr1(feature_name,order=None):

    plt.subplots(figsize =(20, 5))

    plt.subplot(1, 3, 1)

    sns.boxplot(x=feature_name, y="SalePrice", data=train,order=order)

    plt.subplot(1, 3, 2)

    sns.stripplot(x=feature_name, y="SalePrice", data=train, size = 5, jitter = True ,order=order);

    plt.subplot(1, 3, 3)

    sns.barplot(x=feature_name, y="SalePrice", data=train,order=order)

    plt.show()

get_feature_corr1('BedroomAbvGr',order=None)

"""

Bedrooms above grade (does not include basement bedrooms)

"""

kaggle house price-LMLPHP

'\nBedrooms above grade (does not include basement bedrooms)\n'

get_feature_corr1('KitchenAbvGr',order=None)

kaggle house price-LMLPHP

get_feature_corr1('KitchenQual',order=['Fa','TA','Gd','Ex'])

print("""

该特征需要转化category with order

""")

kaggle house price-LMLPHP

该特征需要转化category with order

all_data['KitchenQual'] = all_data['KitchenQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})

all_data['KitchenQual'].unique()

array([3, 2, 4, 1])

TotRmsAbvGrd

get_feature_corr1('TotRmsAbvGrd')

kaggle house price-LMLPHP

Fireplaces

get_feature_corr1('Fireplaces')

kaggle house price-LMLPHP

FireplaceQu

get_feature_corr1('FireplaceQu',order=['Po','Fa','TA','Gd','Ex'])

kaggle house price-LMLPHP

all_data['FireplaceQu'] = all_data['FireplaceQu'].map({"None":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})

all_data['FireplaceQu'].unique()

array([0, 3, 4, 2, 5, 1])

GrLivArea

get_feature_corr('GrLivArea')

kaggle house price-LMLPHP

特征为连续值，且与售价相关性非常强
封箱然后转化为one-hot特征

all_data['GrLivArea_Band'] = pd.cut(all_data['GrLivArea'], 6,labels=list('123456'))

all_data['GrLivArea_Band'].unique()

all_data['GrLivArea_Band'] = all_data['GrLivArea_Band'].astype(int)

all_data.drop('GrLivArea',axis=1,inplace=True)

all_data = pd.get_dummies(all_data, columns = ["GrLivArea_Band"], prefix="GrLivArea")

all_data.head(3)

0	None	3	1Fam	3	1	3	Y	Norm	Norm	...	4.0	1
1	None	3	1Fam	3	4	3	Y	Feedr	Norm	...	3.0	1
2	None	3	1Fam	3	2	3	Y	Norm	Norm	...	4.0	1

3 rows × 152 columns

MSSubClass

get_feature_corr1('MSSubClass')

kaggle house price-LMLPHP

all_data['MSSubClass'] = all_data['MSSubClass'].astype(str)

all_data = pd.get_dummies(all_data, columns = ["MSSubClass"], prefix="MSSubClass")

all_data.head(3)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

0	None	3	1Fam	3	1	3	Y	Norm	Norm	...	1
1	None	3	1Fam	3	4	3	Y	Feedr	Norm	...	0
2	None	3	1Fam	3	2	3	Y	Norm	Norm	...	1

3 rows × 167 columns

BldgType

get_feature_corr1('BldgType')

kaggle house price-LMLPHP

all_data['BldgType'] = all_data['BldgType'].astype(str)

all_data = pd.get_dummies(all_data, columns = ["BldgType"], prefix="BldgType")

all_data.head(3)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1

3 rows × 171 columns

HouseStyle

get_feature_corr1('HouseStyle')

kaggle house price-LMLPHP

all_data['HouseStyle'] = all_data['HouseStyle'].map({"2Story":"2Story", "1Story":"1Story", "1.5Fin":"1.5Story", "1.5Unf":"1.5Story",

                                                     "SFoyer":"SFoyer", "SLvl":"SLvl", "2.5Unf":"2.5Story", "2.5Fin":"2.5Story"})

all_data = pd.get_dummies(all_data, columns = ["HouseStyle"], prefix="HouseStyle")

all_data.head(3)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	0
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	0	1

3 rows × 176 columns

OverallQual

get_feature_corr1('OverallQual')

kaggle house price-LMLPHP

OverallCond

get_feature_corr1('OverallCond')

kaggle house price-LMLPHP

YearRemodAdd

get_feature_corr1('YearRemodAdd')

kaggle house price-LMLPHP

train['Remod_Diff'] = train['YearRemodAdd'] - train['YearBuilt']

plt.subplots(figsize =(40, 10))

sns.barplot(x="Remod_Diff", y="SalePrice", data=train);

kaggle house price-LMLPHP

all_data['Remod_Diff'] = all_data['YearRemodAdd'] - all_data['YearBuilt']

all_data.drop('YearRemodAdd', axis=1, inplace=True)

YearBuilt

get_feature_corr1('YearBuilt')

kaggle house price-LMLPHP

all_data['YearBuilt_Band'] = pd.cut(all_data['YearBuilt'], 7,labels=list('1234567'))

all_data['YearBuilt_Band'].unique()

all_data['YearBuilt_Band'] = all_data['YearBuilt_Band'].astype(int)

all_data.drop('YearBuilt',axis=1,inplace=True)

all_data = pd.get_dummies(all_data, columns = ["YearBuilt_Band"], prefix="YearBuilt")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	0	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	0	1	0
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	0	1

3 rows × 182 columns

Foundation

get_feature_corr1('Foundation')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["Foundation"], prefix="Foundation")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	0	1	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	0	1	0
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	0	1	0	1

3 rows × 187 columns

Functional

get_feature_corr1('Functional')

kaggle house price-LMLPHP

all_data['Functional'] = all_data['Functional'].map({"Sev":1, "Maj2":2, "Maj1":3, "Mod":4, "Min2":5, "Min1":6, "Typ":7})

all_data['Functional'].unique()

array([7, 6, 3, 5, 4, 2, 1])

RoofStyle

get_feature_corr1('RoofStyle')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["RoofStyle"], prefix="RoofStyle")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	0	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1

3 rows × 192 columns

RoofMatl

"""

Roof material.

"""

get_feature_corr1('RoofMatl')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["RoofMatl"], prefix="RoofMatl")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1

3 rows × 198 columns

Exterior1st & Exterior2nd

get_feature_corr1('Exterior1st')

kaggle house price-LMLPHP

get_feature_corr1('Exterior2nd')

kaggle house price-LMLPHP

def Exter2(col):

    if col['Exterior2nd'] == col['Exterior1st']:

        return 1

    else:

        return 0

all_data['ExteriorMatch_Flag'] = all_data.apply(Exter2, axis=1)

all_data.drop('Exterior2nd', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["Exterior1st"], prefix="Exterior1st")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	0
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	0	1

3 rows × 212 columns

MasVnrType

get_feature_corr1('MasVnrType')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["MasVnrType"], prefix="MasVnrType")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1	0
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	0	0	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1	0

3 rows × 215 columns

MasVnrArea

get_feature_corr('MasVnrArea')

kaggle house price-LMLPHP

这个特征没啥意义，各个维度与这个特征的相关性都不是很大，变化都很大，且没有规律

all_data.drop('MasVnrArea', axis=1, inplace=True)

ExterQual

get_feature_corr1('ExterQual',order=['Fa','TA','Gd', 'Ex'])

kaggle house price-LMLPHP

all_data['ExterQual'] = all_data['ExterQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})

all_data['ExterQual'].unique()

array([3, 2, 4, 1])

ExterCond

"""

Evaluates the present condition of the material on the exterior.

"""

'\nEvaluates the present condition of the material on the exterior.\n'

get_feature_corr1('ExterCond',order=['Po','Fa',"TA",'Gd','Ex'])

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["ExterCond"], prefix="ExterCond")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	0	1	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	0	1

3 rows × 218 columns

GarageType

"""

location of the Garage

"""

get_feature_corr1('GarageType')

kaggle house price-LMLPHP

如果观察了该特征，其实可以发现这些现象值是有优劣关系的，但是售价并没有跟特征的优劣值进行对应，因此可以简单将这些特征进行one-hot转化也可以实现，
builtin 的车库房屋售价平均值最高

all_data = pd.get_dummies(all_data, columns = ["GarageType"], prefix="GarageType")

all_data.head(3)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1

3 rows × 224 columns

GarageYrBlt

"""

Year Garage was built

"""

get_feature_corr1('GarageYrBlt')

kaggle house price-LMLPHP

年代越近，售价有逐步走高的趋势

plt.subplots(figsize =(50, 10))

sns.boxplot(x="GarageYrBlt", y="SalePrice", data=train);

kaggle house price-LMLPHP

plt.subplots(figsize =(50, 10))

sns.violinplot(x = 'GarageYrBlt', y = 'SalePrice', data = train,

               linewidth = 2, #线宽

               width = 0.8,   #箱之间的间隔比例

               palette = 'hls', #设置调色板

#                order = {'Thur', 'Fri', 'Sat','Sun'}, #筛选类别

#                scale = 'count',  #测度小提琴图的宽度： area-面积相同,count-按照样本数量决定宽度,width-宽度一样

               gridsize = 50, #设置小提琴图的平滑度，越高越平滑

               inner = 'box', #设置内部显示类型 --> 'box','quartile','point','stick',None

               #bw = 0.8      #控制拟合程度，一般可以不设置

               )

### 新学到的seaborn中的一些新图

<matplotlib.axes._subplots.AxesSubplot at 0x12e2cec50>

kaggle house price-LMLPHP

train['GarageYrBlt'].value_counts()

sns.distplot(train['GarageYrBlt'].dropna(), kde=True, bins=5, rug=True)

<matplotlib.axes._subplots.AxesSubplot at 0x12945c940>

kaggle house price-LMLPHP

all_data['GarageYrBlt_Band']  = pd.qcut(all_data['GarageYrBlt'],3,labels=list('123'))

# qcut是根据这些值的频率来选择箱子的均匀间隔，即每个箱子中含有的数的数量是相同的

# cut将根据值本身来选择箱子均匀间隔，即每个箱子的间距都是相同的

all_data['GarageYrBlt_Band'] = all_data['GarageYrBlt_Band'].astype(int)

all_data.drop(['GarageYrBlt'],axis=1,inplace=True)

all_data = pd.get_dummies(all_data, columns = ["GarageYrBlt_Band"], prefix="GarageYrBlt")  # 默认删除掉原来的特征，因此不必删除旧值

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	1	0
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	0	1

3 rows × 226 columns

GarageFinish

get_feature_corr1('GarageFinish')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["GarageFinish"], prefix="GarageFinish")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	0	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	0	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	0	1	1

3 rows × 229 columns

GarageCars

"""

size of the Garage in car capacity

默认是的数字不用其他操作，3辆车容量的车库售价最高，四辆车的转手频率较低(5个样本)

"""

get_feature_corr1('GarageCars')

kaggle house price-LMLPHP

GarageArea

get_feature_corr('GarageArea')

kaggle house price-LMLPHP

all_data['GarageArea_Band']  = pd.cut(all_data['GarageArea'],3,labels=list('123'))

all_data['GarageArea_Band'] =all_data['GarageArea_Band'].astype('int')

all_data.drop(['GarageArea'],axis=1,inplace=True)

all_data = pd.get_dummies(all_data, columns = ["GarageArea_Band"], prefix="GarageArea")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	0	1	1	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	0	1	1	0
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	0	1	1	0	1

3 rows × 231 columns

GarageQual

"""

Garage  quality

"""

get_feature_corr1('GarageQual',order=['Po','Fa','TA','Gd','Ex'])

kaggle house price-LMLPHP

"TA"的出售的价格有较高的值以及数量较为集中，而两端的数据却很分散，因此可以两边的特征进行合并

all_data['GarageQual'] = all_data['GarageQual'].map({"None":"None", "Po":"Low", "Fa":"Low", "TA":"TA", "Gd":"High", "Ex":"High"})

all_data['GarageQual'].unique()

array(['TA', 'Low', 'High', 'None'], dtype=object)

all_data = pd.get_dummies(all_data, columns = ["GarageQual"], prefix="GarageQual")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	0	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	1	0	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	0	1	1

3 rows × 234 columns

GarageCond

"""

Garage condition.

"""

get_feature_corr1('GarageCond',order=['Po','Fa','TA','Gd','Ex'])

kaggle house price-LMLPHP

该特征与garage quality 特征处理方式类似

all_data['GarageCond']= all_data['GarageCond'].map({"None":'None',"Po":'Low','Fa':'Low','TA':'TA','Gd':'High','Ex':'High'})

all_data['GarageCond'].unique()

array(['TA', 'Low', 'None', 'High'], dtype=object)

all_data = pd.get_dummies(all_data, columns = ["GarageCond"], prefix="GarageCond")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	0	1	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1	1

3 rows × 237 columns

WoodDeckSF

"""

Wood deck area in SF.

"""

get_feature_corr('WoodDeckSF')

kaggle house price-LMLPHP

high correlation with salesPrice
很多的0值，需要单独创建一个特征，来说明是否伟木质材料构建
对于非0值，进行封箱操作，然后转化为one-hot特征

def WoodDeckFlag(col):

    if col['WoodDeckSF'] == 0:

        return 1

    else:

        return 0

all_data['NoWoodDeck_Flag'] = all_data.apply(WoodDeckFlag, axis=1)  # new feature

all_data['WoodDeckSF_Band'] = pd.cut(all_data['WoodDeckSF'], 4,labels=list('1234'))  ## bin

all_data['WoodDeckSF_Band'] = all_data['WoodDeckSF_Band'].astype(int)

all_data.drop('WoodDeckSF', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["WoodDeckSF_Band"], prefix="WoodDeckSF")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	1	0	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1	1	1

3 rows × 241 columns

TotalPorchSF

"""

OpenPorchSF, EnclosedPorch, 3SsnPorch & ScreenPorch

I will sum these features together to create a total porch in square feet feature.

"""

all_data['TotalPorchSF'] = all_data['OpenPorchSF'] + all_data['OpenPorchSF'] + all_data['EnclosedPorch'] + all_data['3SsnPorch'] + all_data['ScreenPorch']

train['TotalPorchSF'] = train['OpenPorchSF'] + train['OpenPorchSF'] + train['EnclosedPorch'] + train['3SsnPorch'] + train['ScreenPorch']

get_feature_corr('TotalPorchSF')

kaggle house price-LMLPHP

def PorchFlag(col):

    if col['TotalPorchSF'] == 0:

        return 1

    else:

        return 0

all_data['NoPorch_Flag'] = all_data.apply(PorchFlag, axis=1)

all_data['TotalPorchSF_Band'] = pd.cut(all_data['TotalPorchSF'], 4,labels=list('1234'))

all_data['TotalPorchSF_Band'].unique()

all_data['TotalPorchSF_Band'] = all_data['TotalPorchSF_Band'].astype(int)

all_data.drop('TotalPorchSF', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["TotalPorchSF_Band"], prefix="TotalPorchSF")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1	0	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	0	1	1	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1	0	1

3 rows × 246 columns

PoolArea

"""

PoolArea Pool area in square feet.

"""

get_feature_corr('PoolArea')

kaggle house price-LMLPHP

def PoolFlag(col):

    if col['PoolArea'] == 0:

        return 0

    else:

        return 1

all_data['HasPool_Flag'] = all_data.apply(PoolFlag, axis=1)

all_data.drop('PoolArea', axis=1, inplace=True)

PoolQC

"""

Pool quality.

"""

get_feature_corr1('PoolQC',order=['Fa','Gd','Ex'])

kaggle house price-LMLPHP

all_data['PoolQC'].value_counts()  #  总共8个数据带pool，其他的都是不带的，所以拿到的这个quality数据意义不大

None    2907

Gd         3

Ex         3

Fa         2

Name: PoolQC, dtype: int64

all_data.drop('PoolQC', axis=1, inplace=True)

Fence

'''

Fence: Fence quality

       GdPrv	Good Privacy

       MnPrv	Minimum Privacy

       GdWo	Good Wood

       MnWw	Minimum Wood/Wire

       NA	No Fence

'''

get_feature_corr1('Fence',order=['MnWw','GdWo','MnPrv','GdPrv'])

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["Fence"], prefix="Fence")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1

3 rows × 249 columns

MSZoning

"""

MSZoning: Identifies the general zoning classification of the sale.

       A	Agriculture

       C	Commercial

       FV	Floating Village Residential

       I	Industrial

       RH	Residential High Density

       RL	Residential Low Density

       RP	Residential Low Density Park

       RM	Residential Medium Density

"""

get_feature_corr1('MSZoning')

all_data['MSZoning'].value_counts()

kaggle house price-LMLPHP

RL         2265

RM          460

FV          139

RH           26

C (all)      25

Name: MSZoning, dtype: int64

all_data = pd.get_dummies(all_data, columns = ["MSZoning"], prefix="MSZoning")

all_data.head(3)

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	1	1
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	1	1

3 rows × 253 columns

Neighborhood

"""

this feature has lots of values,and SalePrice varies a lot in the values of the feature,

we  just use one-hot to transform this feature

"""

get_feature_corr1('Neighborhood')

all_data = pd.get_dummies(all_data, columns = ["Neighborhood"], prefix="Neighborhood")

all_data.head(3)

kaggle house price-LMLPHP

0	None	3	3	1	3	Y	Norm	Norm	SBrkr	...	0
1	None	3	3	4	3	Y	Feedr	Norm	SBrkr	...	1
2	None	3	3	2	3	Y	Norm	Norm	SBrkr	...	0

3 rows × 277 columns

Condition1 & Condition2

print('condition1')

get_feature_corr1('Condition1')

print('condition2')

get_feature_corr1('Condition2')

condition1

kaggle house price-LMLPHP

condition2

kaggle house price-LMLPHP

'''

Condition1: Proximity to various conditions

       Artery	Adjacent to arterial street

       Feedr	Adjacent to feeder street

       Norm	Normal

       RRNn	Within 200' of North-South Railroad

       RRAn	Adjacent to North-South Railroad

       PosN	Near positive off-site feature--park, greenbelt, etc.

       PosA	Adjacent to postive off-site feature

       RRNe	Within 200' of East-West Railroad

       RRAe	Adjacent to East-West Railroad

'''

all_data['Condition1'] = all_data['Condition1'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",

                                                    "RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})

all_data['Condition2'] = all_data['Condition2'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",

                                                    "RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})

def ConditionMatch(col):

    if col['Condition1'] == col['Condition2']:

        return 0

    else:

        return 1

all_data['Diff2ndCondition_Flag'] = all_data.apply(ConditionMatch, axis=1)

all_data.drop('Condition2', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["Condition1"], prefix="Condition1")

all_data.head(3)

0	None	3	3	1	3	Y	SBrkr	3	...	0	0	1	0
1	None	3	3	4	3	Y	SBrkr	2	...	1	1	0	1
2	None	3	3	2	3	Y	SBrkr	3	...	0	0	1	0

3 rows × 280 columns

LotFrontage

"""

Linear feet of street connected to property.

"""

get_feature_corr('LotFrontage')

kaggle house price-LMLPHP

该特征与saleprice 没有明显的相关性，可以考虑去掉该特征

LotArea

'''

Lot size in square feet.

'''

get_feature_corr('LotArea')

kaggle house price-LMLPHP

该特征与saleprice有着明显的相关性，且该特征与saleprice呈现一个正偏态（峰左移，右偏，正偏）

all_data['LotArea_Band'] = pd.qcut(all_data['LotArea'], 8,labels=list('12345678'))  # 针对分布不均匀的特征使用qcut进行封箱

all_data['LotArea_Band'].unique()

all_data['LotArea_Band'] = all_data['LotArea_Band'].astype(int)

all_data.drop('LotArea', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["LotArea_Band"], prefix="LotArea")

all_data.head(3)

0	None	3	3	1	3	Y	SBrkr	3	...	0	1	0	0
1	None	3	3	4	3	Y	SBrkr	2	...	1	0	1	0
2	None	3	3	2	3	Y	SBrkr	3	...	0	0	0	1

3 rows × 287 columns

LotShape

"""

LotShape: General shape of property

       Reg	Regular

       IR1	Slightly irregular

       IR2	Moderately Irregular

       IR3	Irregula

该特征能够明显的影响售价，在国外，不仅仅要有大的面积数，而且尺寸也要合理，否则也很能卖出高价

"""

get_feature_corr1('LotShape')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["LotShape"], prefix="LotShape")

all_data.head(3)

print("地皮的形状主要集中在Reg，Reg1两个值里面，而且salerice在不同的属性里面变化很大")

地皮的形状主要集中在Reg，Reg1两个值里面，而且salerice在不同的属性里面变化很大

LandContour

"""

LandContour: Flatness of the property

       Lvl	Near Flat/Level

       Bnk	Banked - Quick and significant rise from street grade to building

       HLS	Hillside - Significant slope from side to side

       Low	Depression

"""

get_feature_corr1('LandContour')

all_data = pd.get_dummies(all_data, columns = ["LandContour"], prefix="LandContour")

all_data.head(3)

kaggle house price-LMLPHP

0	None	3	3	1	3	Y	SBrkr	3	...	0	1	1
1	None	3	3	4	3	Y	SBrkr	2	...	0	1	1
2	None	3	3	2	3	Y	SBrkr	3	...	1	0	1

3 rows × 293 columns

LotConfig

"""

LotConfig: Lot configuration

       Inside	Inside lot 内部

       Corner	Corner lot 角落

       CulDSac	Cul-de-sac 死胡同

       FR2	Frontage on 2 sides of property 前排

       FR3	Frontage on 3 sides of property  前排

房子周围的环境

"""

get_feature_corr1('LotConfig')

all_data['LotConfig'] = all_data['LotConfig'].map({"Inside":"Inside", "FR2":"FR", "Corner":"Corner", "CulDSac":"CulDSac", "FR3":"FR"})

all_data = pd.get_dummies(all_data, columns = ["LotConfig"], prefix="LotConfig")

all_data.head(3)

kaggle house price-LMLPHP

0	None	3	3	1	3	Y	SBrkr	3	...	1	1	0	1
1	None	3	3	4	3	Y	SBrkr	2	...	1	1	1	0
2	None	3	3	2	3	Y	SBrkr	3	...	0	1	0	1

3 rows × 296 columns

LandSlope

"""

LandSlope: Slope of property

       Gtl	Gentle slope

       Mod	Moderate Slope

       Sev	Severe Slope

"""

get_feature_corr1('LandSlope')

kaggle house price-LMLPHP

all_data['LandSlope'] = all_data['LandSlope'].map({"Gtl":1, "Mod":0, "Sev":0})

'''

Mod and Sev saleprice 处于同一区间，可以将两者合并

'''

'\nMod and Sev saleprice 处于同一区间，可以将两者合并\n'

all_data['LandSlope'].value_counts()

1    2774

0     141

Name: LandSlope, dtype: int64

Street

get_feature_corr1('Street')

kaggle house price-LMLPHP

Pave中价格变化很大，且Grvl数量太少，所以该特征意义不大，直接去掉

all_data.drop('Street', axis=1, inplace=True)

Alley

get_feature_corr1('Alley')

kaggle house price-LMLPHP

all_data['Alley'].value_counts()

None    2717

Grvl     120

Pave      78

Name: Alley, dtype: int64

all_data = pd.get_dummies(all_data, columns = ["Alley"], prefix="Alley")

all_data.head(3)

0	3	3	1	3	Y	SBrkr	3	0	...	1	0	1	1
1	3	3	4	3	Y	SBrkr	2	3	...	1	1	0	1
2	3	3	2	3	Y	SBrkr	3	3	...	1	0	1	1

3 rows × 297 columns

PvaeDrive

"""

PavedDrive: Paved driveway

       Y	Paved 价格差异较大，且没有明显的顺序关系，需要转化为one-hot特征

       P	Partial Pavement

       N	Dirt/Gravel

"""

get_feature_corr1('PavedDrive')

kaggle house price-LMLPHP

all_data=pd.get_dummies(all_data,columns=['PavedDrive'],prefix='PavedDrive')

all_data.head()

0	3	3	1	3	Y	SBrkr	0	3	0	...	0	0	1	1	1
1	3	3	4	3	Y	SBrkr	0	2	3	...	0	1	0	1	1
2	3	3	2	3	Y	SBrkr	0	3	3	...	0	0	1	1	1
3	3	4	1	2	Y	SBrkr	272	2	4	...	1	0	0	1	1
4	4	3	3	3	Y	SBrkr	0	3	3	...	0	1	0	1	1

5 rows × 299 columns

Heating

get_feature_corr1('Heating')

kaggle house price-LMLPHP

"""

大量集中在GasA，其余的数据量非常小，可以转化为天然气供暖，和其他方式供暖

"""

all_data['Heating']  = all_data['Heating'].map({'GasA':1,'GasW':0,'Grav':0,'Wall':0,'OthW':0,'Floor':0})

all_data.drop('Heating', axis=1, inplace=True)

all_data.head(3)

0	3	3	1	3	Y	SBrkr	3	0	...	0	1	1	1
1	3	3	4	3	Y	SBrkr	2	3	...	1	0	1	1
2	3	3	2	3	Y	SBrkr	3	3	...	0	1	1	1

3 rows × 298 columns

HeatingQC

"""

Heating quality and condition.

"""

get_feature_corr1('HeatingQC',order=['Po','Fa','TA','Gd','Ex'])

kaggle house price-LMLPHP

all_data['HeatingQC'] = all_data['HeatingQC'].map({"Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})

all_data['HeatingQC'].unique()

array([5, 4, 3, 2, 1])

CentralAir

"""

Central air conditioning.

"""

get_feature_corr1('CentralAir')

kaggle house price-LMLPHP

all_data['CentralAir'] = all_data['CentralAir'].map({"Y":1,"N":0})

Electrical

"""

Electrical system.

"""

get_feature_corr1('Electrical')

kaggle house price-LMLPHP

all_data['Electrical'] = all_data['Electrical'].map({'SBrkr':'SBrkr','FuseF':'Fuse','FuseA':'Fuse','FuseP':'Fuse','Mix':'Mix'})

all_data = pd.get_dummies(all_data, columns = ["Electrical"], prefix="Electrical")

all_data.head(3)

0	3	3	1	3	1	3	0	0	...	1	1	1	1
1	3	3	4	3	1	2	3	1	...	0	1	1	1
2	3	3	2	3	1	3	3	1	...	1	1	1	1

3 rows × 300 columns

all_data['MiscFeature'].value_counts()  #

None    2810

Shed      95

Gar2       5

Othr       4

TenC       1

Name: MiscFeature, dtype: int64

get_feature_corr1('MiscFeature')

'''

有效数据太少，剔除该特征

'''

kaggle house price-LMLPHP

'\n有效数据太少，剔除该特征\n'

get_feature_corr1('MiscVal')

kaggle house price-LMLPHP

all_data['MiscVal'].value_counts()

"""

有效数据过少，剔除该特征

"""

'\n有效数据过少，剔除该特征\n'

all_data.drop(['MiscVal','MiscFeature'],axis=1,inplace=True)

MoSold and YrSold

"""

month sold,Year Sold

"""

get_feature_corr1('MoSold')

kaggle house price-LMLPHP

get_feature_corr1('YrSold')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["MoSold"], prefix="MoSold")

all_data = pd.get_dummies(all_data,columns=['YrSold'],prefix='YrSold')

all_data.head(3)

0	3	3	1	3	1	3	0	0	...	0	0	1
1	3	3	4	3	1	2	3	1	...	0	1	0
2	3	3	2	3	1	3	3	1	...	1	0	1

3 rows × 313 columns

SaleType

"""

SaleType: Type of sale

       WD 	Warranty Deed - Conventional

       CWD	Warranty Deed - Cash

       VWD	Warranty Deed - VA Loan

       New	Home just constructed and sold

       COD	Court Officer Deed/Estate

       Con	Contract 15% Down payment regular terms

       ConLw	Contract Low Down payment and low interest

       ConLI	Contract Low Interest

       ConLD	Contract Low Down

       Oth	Other

"""

get_feature_corr1('SaleType')

kaggle house price-LMLPHP

all_data['SaleType'] = all_data['SaleType'].map({'WD':"WD",'New':"New",'COD':"COD",'CWD':'Oth','ConLD':'Oth','ConLI':'Oth',

                                                "ConLW":'Oth','Con':'Oth','Oth':'Oth'})

all_data=  pd.get_dummies(all_data,columns=['SaleType'],prefix='SaleType')

all_data.head()

0	3	3	1	3	1	0	3	0	0	...	0	0	0	1	1
1	3	3	4	3	1	0	2	3	1	...	0	0	1	0	1
2	3	3	2	3	1	0	3	3	1	...	0	0	0	1	1
3	3	4	1	2	1	272	2	4	1	...	0	1	0	0	1
4	4	3	3	3	1	0	3	3	1	...	1	0	0	1	1

5 rows × 316 columns

SaleCondition

"""

Condition of sale.

"""

get_feature_corr1('SaleCondition')

kaggle house price-LMLPHP

all_data = pd.get_dummies(all_data, columns = ["SaleCondition"], prefix="SaleCondition")

all_data.head(3)

0	3	3	1	3	1	3	0	0	...	1	1
1	3	3	4	3	1	2	3	1	...	1	1
2	3	3	2	3	1	3	3	1	...	1	1

3 rows × 321 columns

目标值转换

与分类算法不同，回归是用算法拟合连续值
通常需要对目标值进行分布进行分析，机器学习的算法对于正态分布的数据一般都有很高的拟合度，如果目标值为偏正态分布，需要将目标值转化为正态分布

from scipy.stats import skew, norm

plt.subplots(figsize=(15,12))

g = sns.distplot(train['SalePrice'],fit=norm,label="Skewness:%.2f" % (train['SalePrice'].skew()))

g.legend(loc='best')

<matplotlib.legend.Legend at 0x12f5f5cc0>

kaggle house price-LMLPHP

目标变量为正偏态，可以是用numpy中的函数，将其转化

train["SalePrice"] = np.log1p(train["SalePrice"])

y_train = train["SalePrice"]

#Check the new distribution

plt.subplots(figsize=(15,10))

g = sns.distplot(train['SalePrice'], fit=norm, label = "Skewness : %.2f"%(train['SalePrice'].skew()));

g = g.legend(loc="best")

kaggle house price-LMLPHP

处理数据中偏态的特征

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check how skewed they are

skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)

plt.subplots(figsize =(65, 20))

skewed_feats.plot(kind='bar');

kaggle house price-LMLPHP



from scipy.special import boxcox1p

skewness = skewed_feats[abs(skewed_feats) > 0.5]

skewed_features = skewness.index

lam = 0.15

for feat in skewed_features:

    all_data[feat] = boxcox1p(all_data[feat], lam)

print(skewness.shape[0],  "skewed numerical features have been Box-Cox transformed")

294 skewed numerical features have been Box-Cox transformed

准备模型训练的数据

train = all_data[:ntrain]

test = all_data[ntrain:]

print(train.shape)

print(test.shape)

(1456, 321)

(1459, 321)

y_train.shape

(1456,)

feature importance

import xgboost as xgb

model = xgb.XGBRegressor()

model.fit(train, y_train)

# Sort feature importances from GBC model trained earlier

indices = np.argsort(model.feature_importances_)[::-1]

indices = indices[:75]

# Visualise these with a barplot

plt.subplots(figsize=(20, 15))

g = sns.barplot(y=train.columns[indices], x = model.feature_importances_[indices], orient='h')

g.set_xlabel("Relative importance",fontsize=12)

g.set_ylabel("Features",fontsize=12)

g.tick_params(labelsize=9)

g.set_title("XGB feature importance");

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version

  data.base is not None and isinstance(data, np.ndarray) \

[11:04:46] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

kaggle house price-LMLPHP

xgb_train = train.copy()

xgb_test = test.copy()

from sklearn.feature_selection import SelectFromModel

xgb_feat_red = SelectFromModel(model,prefit=True)

# reduce estimation validation and test datasets

xgb_train = xgb_feat_red.transform(xgb_train)

xgb_test = xgb_feat_red.transform(xgb_test)

print('X_train: ', xgb_train.shape, '\nX_test: ', xgb_test.shape)

X_train:  (1456, 47)

X_test:  (1459, 47)



from sklearn import model_selection

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(xgb_train, y_train, test_size=0.3, random_state=42)

# X_train = predictor features for estimation dataset

# X_test = predictor variables for validation dataset

# Y_train = target variable for the estimation dataset

# Y_test = target variable for the estimation dataset

print('X_train: ', X_train.shape, '\nX_test: ', X_test.shape, '\nY_train: ', Y_train.shape, '\nY_test: ', Y_test.shape)

X_train:  (1019, 47)

X_test:  (437, 47)

Y_train:  (1019,)

Y_test:  (437,)

X_train

array([[0.73046315, 3.        , 0.73046315, ..., 0.        , 0.        ,

        0.        ],

       [0.73046315, 3.        , 0.73046315, ..., 0.        , 0.        ,

        0.        ],

       [1.19431764, 2.        , 0.73046315, ..., 0.        , 0.        ,

        0.        ],

       ...,

       [1.8203341 , 3.        , 0.73046315, ..., 0.73046315, 0.        ,

        0.        ],

       [0.73046315, 3.        , 0.73046315, ..., 0.        , 0.        ,

        0.        ],

       [1.54096276, 3.        , 0.73046315, ..., 0.        , 0.        ,

        0.        ]])

训练不同的模型

# 从sklearn 导入不同的回归模型

from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC

from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor, ExtraTreesRegressor

from sklearn.kernel_ridge import KernelRidge

import xgboost as xgb

print('Algorithm packages imported!')

Algorithm packages imported!

# Model selection packages used for sampling dataset and optimising parameters

from sklearn import model_selection

from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import ShuffleSplit

print('Model selection packages imported!')

Model selection packages imported!

models = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),xgb.XGBRegressor()]

# 随机取样，其实可以使用正常的split，然后选择里面的shuffle = True

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

shuff =ShuffleSplit(n_splits=5,test_size=0.2,random_state=42)

# 创建一个数据框，用于保存模型的指标

columns = ['Name','Parameters','Train mean_squared_error','Test mean_squared_error']

before_model_compare = pd.DataFrame(columns=columns)

# 将模型的参数以及结果添加到DataFrame中

row_index=0

for alg in models:

    model_name = alg.__class__.__name__

    before_model_compare.loc[row_index,'Name'] = model_name

    before_model_compare.loc[row_index,'Parameters'] = str(alg.get_params())

    alg.fit(X_train,Y_train)

    # for cross_validation  but the results are negative,we need to convert it to postive,均方误差

    training_results = np.sqrt((-cross_val_score(alg,X_train,Y_train,cv=shuff,scoring='neg_mean_squared_error')).mean())

    test_results = np.sqrt(((Y_test-alg.predict(X_test))**2).mean())

    before_model_compare.loc[row_index,"Train mean_squared_error"] = training_results*100

    before_model_compare.loc[row_index,'Test mean_squared_error'] = test_results*100

    row_index+=1

    print(row_index,model_name,"trained>>>>")

decimals = 3

before_model_compare['Train mean_squared_error'] = before_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))

before_model_compare['Test mean_squared_error'] = before_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))

before_model_compare

1 KernelRidge trained>>>>

2 ElasticNet trained>>>>

3 Lasso trained>>>>

4 GradientBoostingRegressor trained>>>>

5 BayesianRidge trained>>>>

6 LassoLarsIC trained>>>>

7 RandomForestRegressor trained>>>>

[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

[12:04:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

8 XGBRegressor trained>>>>

0	KernelRidge	{'alpha': 1, 'coef0': 1, 'degree': 3, 'gamma':...	31.424	31.424
1	ElasticNet	{'alpha': 1.0, 'copy_X': True, 'fit_intercept'...	23.245	23.245
2	Lasso	{'alpha': 1.0, 'copy_X': True, 'fit_intercept'...	28.008	28.008
3	GradientBoostingRegressor	{'alpha': 0.9, 'criterion': 'friedman_mse', 'i...	12.381	12.381
4	BayesianRidge	{'alpha_1': 1e-06, 'alpha_2': 1e-06, 'compute_...	11.118	11.118
5	LassoLarsIC	{'copy_X': True, 'criterion': 'aic', 'eps': 2....	11.818	11.818
6	RandomForestRegressor	{'bootstrap': True, 'criterion': 'mse', 'max_d...	14.299	14.299
7	XGBRegressor	{'base_score': 0.5, 'booster': 'gbtree', 'cols...	12.466	12.466

优化参数

开始的时候，我们准备了不同模型简单的看了模型的评价以及训练结果
实际上，这些模型都需要进一步的参数优化
下一步需要是用GridSearch进行参数的调整

models = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),

         xgb.XGBRegressor()]

KR_param_grid = {'alpha': [0.1], 'coef0': [100], 'degree': [1], 'gamma': [None], 'kernel': ['polynomial']}

EN_param_grid = {'alpha': [0.001], 'copy_X': [True], 'l1_ratio': [0.6], 'fit_intercept': [True], 'normalize': [False],

                         'precompute': [False], 'max_iter': [300], 'tol': [0.001], 'selection': ['random'], 'random_state': [None]}

LASS_param_grid = {'alpha': [0.0005], 'copy_X': [True], 'fit_intercept': [True], 'normalize': [False], 'precompute': [False],

                    'max_iter': [300], 'tol': [0.01], 'selection': ['random'], 'random_state': [None]}

GB_param_grid = {'loss': ['huber'], 'learning_rate': [0.1], 'n_estimators': [300], 'max_depth': [3],

                                        'min_samples_split': [0.0025], 'min_samples_leaf': [5]}

BR_param_grid = {'n_iter': [200], 'tol': [0.00001], 'alpha_1': [0.00000001], 'alpha_2': [0.000005], 'lambda_1': [0.000005],

                 'lambda_2': [0.00000001], 'copy_X': [True]}

LL_param_grid = {'criterion': ['aic'], 'normalize': [True], 'max_iter': [100], 'copy_X': [True], 'precompute': ['auto'], 'eps': [0.000001]}

RFR_param_grid = {'n_estimators': [50], 'max_features': ['auto'], 'max_depth': [None], 'min_samples_split': [5], 'min_samples_leaf': [2]}

XGB_param_grid = {'max_depth': [3], 'learning_rate': [0.1], 'n_estimators': [300], 'booster': ['gbtree'], 'gamma': [0], 'reg_alpha': [0.1],

                  'reg_lambda': [0.7], 'max_delta_step': [0], 'min_child_weight': [1], 'colsample_bytree': [0.5], 'colsample_bylevel': [0.2],

                  'scale_pos_weight': [1]}

params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]

after_model_compare = pd.DataFrame(columns=columns)

row_index= 0

for alg in models:

    gs_alg = GridSearchCV(alg,param_grid=params_grid[0],cv=shuff,scoring='neg_mean_squared_error',n_jobs=-1)

    params_grid.pop(0)

    model_name = alg.__class__.__name__

    after_model_compare.loc[row_index,'Name'] = model_name

    gs_alg.fit(X_train,Y_train)

    gs_best=gs_alg.best_estimator_

    after_model_compare.loc[row_index,"Parameters"] = str(gs_alg.best_params_)

    after_training_results = np.sqrt(-gs_alg.best_score_)

    after_test_results = np.sqrt((Y_test-gs_alg.predict(X_test)**2).mean())

    after_model_compare.loc[row_index,"Train mean_squared_error"] = after_training_results*100

    after_model_compare.loc[row_index,'Test mean_squared_error']= after_test_results*100

    row_index+=1

    print(row_index,model_name,"trained>>>>>")

decimals = 3

after_model_compare['Train mean_squared_error'] = after_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))

after_model_compare['Test mean_squared_error'] = after_model_compare['Train mean_squared_error'].apply(lambda x:round(x,decimals))

after_model_compare

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt

1 KernelRidge trained>>>>>

2 ElasticNet trained>>>>>

3 Lasso trained>>>>>

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt

4 GradientBoostingRegressor trained>>>>>

5 BayesianRidge trained>>>>>

6 LassoLarsIC trained>>>>>

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/ipykernel_launcher.py:33: RuntimeWarning: invalid value encountered in sqrt

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

7 RandomForestRegressor trained>>>>>

[19:23:22] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

8 XGBRegressor trained>>>>>

0	KernelRidge	{'alpha': 0.1, 'coef0': 100, 'degree': 1, 'gam...	11.140	11.140
1	ElasticNet	{'alpha': 0.001, 'copy_X': True, 'fit_intercep...	11.234	11.234
2	Lasso	{'alpha': 0.0005, 'copy_X': True, 'fit_interce...	11.203	11.203
3	GradientBoostingRegressor	{'learning_rate': 0.1, 'loss': 'huber', 'max_d...	11.966	11.966
4	BayesianRidge	{'alpha_1': 1e-08, 'alpha_2': 5e-06, 'copy_X':...	11.118	11.118
5	LassoLarsIC	{'copy_X': True, 'criterion': 'aic', 'eps': 1e...	11.818	11.818
6	RandomForestRegressor	{'max_depth': None, 'max_features': 'auto', 'm...	13.735	13.735
7	XGBRegressor	{'booster': 'gbtree', 'colsample_bylevel': 0.2...	11.964	11.964

stacking method

准备一系列的算法模型
将train训练数据分割为训练数据和验证数据(X_trian，Y_train，X_test,Y_test)
在X_train数据集中进行算法拟合，然后将训练出来的模型去拟合X_test(验证集），将模型拟合出的验证集的结果和实际的Y_test组成的新的训练数据（new_train datasets）
将训练出来的模型去拟合test数据集，得到每个模型预测的结果，组成醒的test数据集，new_test dataset
用一个相对简单或者使用不同的模型（meta-model），比如说lasso，将新的训练进行拟合，然后将拟合后的模型预测新的测试集new_test_dataset，得到新的模型
将新的模型去拟合新的测试集（new_test_dataset），得到预测的结果

models  = [KernelRidge(),ElasticNet(),Lasso(),GradientBoostingRegressor(),BayesianRidge(),LassoLarsIC(),RandomForestRegressor(),xgb.XGBRegressor()]

names = ['KernelRidge','ElasticNet','Lasso','GradientBoostingRegressor','BayesianRidge','LassoLarsIC','RandomForest','XGBoost']

params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]

stacked_validation_train = pd.DataFrame()

stacked_test_train = pd.DataFrame()

row_index= 0

for alg in models:

    gs_alg = GridSearchCV(alg,param_grid=params_grid[0],cv=shuff,scoring='neg_mean_squared_error',n_jobs=-1)

    params_grid.pop(0)

    gs_alg.fit(X_train,Y_train)

    gs_best = gs_alg.best_estimator_

    stacked_validation_train.insert(loc= row_index,column=names[0],value=gs_best.predict(X_test))

    """  dataFrme insert (loc 表示的是列的序号，column 列名，value 插入的内容)"""

    print(row_index+1,alg.__class__.__name__,"将验证集的预测的结果堆砌，组成新的训练集")

    stacked_test_train.insert(loc=row_index,column=names[0],value=gs_best.predict(xgb_test))

    print(row_index+1,alg.__class__.__name__,"将测试集的预测的结果堆砌，组成新的测试集")

    print("---"*50)

    names.pop(0)

    row_index+=1

print("第一层数据处理完成，新的训练集与测试集完成")

1 KernelRidge 将验证集的预测的结果堆砌，组成新的训练集

1 KernelRidge 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

2 ElasticNet 将验证集的预测的结果堆砌，组成新的训练集

2 ElasticNet 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

3 Lasso 将验证集的预测的结果堆砌，组成新的训练集

3 Lasso 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

4 GradientBoostingRegressor 将验证集的预测的结果堆砌，组成新的训练集

4 GradientBoostingRegressor 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

5 BayesianRidge 将验证集的预测的结果堆砌，组成新的训练集

5 BayesianRidge 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

6 LassoLarsIC 将验证集的预测的结果堆砌，组成新的训练集

6 LassoLarsIC 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

7 RandomForestRegressor 将验证集的预测的结果堆砌，组成新的训练集

7 RandomForestRegressor 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

[15:23:01] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

8 XGBRegressor 将验证集的预测的结果堆砌，组成新的训练集

8 XGBRegressor 将测试集的预测的结果堆砌，组成新的测试集

------------------------------------------------------------------------------------------------------------------------------------------------------

第一层数据处理完成，新的训练集与测试集完成

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

print(stacked_validation_train.shape)

stacked_validation_train.head()

# Y_test的数据结果

(437, 8)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

0	12.096814	12.095574	12.095347	12.103610	12.095675	12.104932	12.170897	12.084927
1	11.952395	11.966939	11.964576	12.027570	11.957859	11.999328	12.066678	12.071651
2	11.798390	11.800390	11.807569	11.842686	11.807968	11.787126	11.880778	11.789903
3	11.834224	11.814334	11.820662	11.806835	11.840026	11.837654	11.755137	11.753889
4	11.287412	11.267859	11.271162	11.150576	11.289689	11.290524	11.328786	11.278980

print(stacked_test_train.shape)

stacked_test_train.head()

(1459, 8)

0	11.655653	11.666206	11.661235	11.717153	11.664298	11.639410	11.735618	11.754628
1	12.033653	12.042914	12.039875	11.950150	12.032724	12.007921	11.956780	11.985191
2	12.121196	12.121925	12.124266	12.138572	12.125334	12.072644	12.097413	12.115376
3	12.194246	12.200128	12.201113	12.166538	12.196015	12.143436	12.095009	12.139894
4	12.171520	12.180859	12.179168	12.145913	12.167523	12.168576	12.178091	12.176064

stacked_validation_train.drop('Lasso',axis=1,inplace=True)

stacked_test_train.drop('Lasso',axis=1,inplace=True)

from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import RobustScaler

meta_model = make_pipeline(RobustScaler(),Lasso(alpha=0.00001,copy_X=True,fit_intercept=True,normalize=False,precompute=False,

                                               max_iter=10000,tol=0.0001,selection='random',random_state=42))

meta_model.fit(stacked_validation_train,Y_test)

meta_model_pred= np.expm1(meta_model.predict(stacked_test_train))

print("meta_model 完成训练，并预测测试集的数据")

meta_model 完成训练，并预测测试集的数据

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.7538551527086552, tolerance: 0.006483051719467419

  positive)

models = [KernelRidge(), ElasticNet(), Lasso(), GradientBoostingRegressor(), BayesianRidge(), LassoLarsIC(), RandomForestRegressor(), xgb.XGBRegressor()]

names = ['KernelRidge', 'ElasticNet', 'Lasso', 'Gradient Boosting', 'Bayesian Ridge', 'Lasso Lars IC', 'Random Forest', 'XGBoost']

params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]

final_predictions = pd.DataFrame()

row_index=0

for alg in models:

    gs_alg = GridSearchCV(alg, param_grid = params_grid[0], cv = shuff, scoring = 'neg_mean_squared_error', n_jobs=-1)

    params_grid.pop(0)

    gs_alg.fit(stacked_validation_train, Y_test)

    gs_best = gs_alg.best_estimator_

    final_predictions.insert(loc = row_index, column = names[0], value = np.expm1(gs_best.predict(stacked_test_train)))

    print(row_index+1, alg.__class__.__name__, 'final results predicted added to table...')

    names.pop(0)

    row_index+=1

print("-"*50)

print("已经完成")

final_predictions.head()

1 KernelRidge final results predicted added to table...

2 ElasticNet final results predicted added to table...

3 Lasso final results predicted added to table...

4 GradientBoostingRegressor final results predicted added to table...

5 BayesianRidge final results predicted added to table...

6 LassoLarsIC final results predicted added to table...

7 RandomForestRegressor final results predicted added to table...

[18:03:42] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

8 XGBRegressor final results predicted added to table...

--------------------------------------------------

已经完成

/Users/aihuishou/anaconda3/envs/work/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version

  if getattr(data, 'base', None) is not None and \

0	120698.786728	121126.968875	120569.541877	119545.552352	121817.672344	121618.593011	120774.731602	117987.320312
1	162778.261755	162293.616103	163198.661456	154034.245333	162888.953970	162663.194168	154944.085742	154422.265625
2	184187.690046	183822.395933	184145.902661	181996.954345	185167.984485	184643.383928	181824.224304	174336.687500
3	193128.541814	192388.040730	193035.580999	195110.109361	193760.580424	193069.794744	188563.541259	181933.593750
4	192957.823204	192839.290437	193289.070140	192292.299199	192910.466862	192890.725826	190770.891456	192144.093750

ensemble = meta_model_pred*(1/10) + final_predictions['XGBoost']*(1.5/10) + final_predictions['Gradient Boosting']*(2/10) + final_predictions['Bayesian Ridge']*(1/10) + final_predictions['Lasso']*(1/10) + final_predictions['KernelRidge']*(1/10) + final_predictions['Lasso Lars IC']*(1/10) + final_predictions['Random Forest']*(1.5/10)

submission = pd.DataFrame()

test1 = pd.read_csv('test.csv',index_col=False)

test_ID = test1['Id']

submission['Id'] = test_ID

submission['SalePrice'] = ensemble

submission.to_csv('final_submission.csv',index=False)

print("Submission file, created!")

Submission file, created!

0	3	3	1	3	1	3	0	0	...	1	1	1	1
1	3	3	4	3	1	2	3	1	...	0	1	1	1
2	3	3	2	3	1	3	3	1	...	1	1	1	1

0	3	3	1	3	1	3	0	0	...	0	0	1
1	3	3	4	3	1	2	3	1	...	0	1	0
2	3	3	2	3	1	3	3	1	...	1	0	1

0	3	3	1	3	1	0	3	0	0	...	0	0	0	1	1
1	3	3	4	3	1	0	2	3	1	...	0	0	1	0	1
2	3	3	2	3	1	0	3	3	1	...	0	0	0	1	1
3	3	4	1	2	1	272	2	4	1	...	0	1	0	0	1
4	4	3	3	3	1	0	3	3	1	...	1	0	0	1	1

0	3	3	1	3	1	3	0	0	...	1	1
1	3	3	4	3	1	2	3	1	...	1	1
2	3	3	2	3	1	3	3	1	...	1	1

0	3	3	1	3	1	3	0	0	...	1	1	1	1
1	3	3	4	3	1	2	3	1	...	0	1	1	1
2	3	3	2	3	1	3	3	1	...	1	1	1	1

0	3	3	1	3	1	3	0	0	...	0	0	1
1	3	3	4	3	1	2	3	1	...	0	1	0
2	3	3	2	3	1	3	3	1	...	1	0	1

0	3	3	1	3	1	0	3	0	0	...	0	0	0	1	1
1	3	3	4	3	1	0	2	3	1	...	0	0	1	0	1
2	3	3	2	3	1	0	3	3	1	...	0	0	0	1	1
3	3	4	1	2	1	272	2	4	1	...	0	1	0	0	1
4	4	3	3	3	1	0	3	3	1	...	1	0	0	1	1

0	3	3	1	3	1	3	0	0	...	1	1
1	3	3	4	3	1	2	3	1	...	1	1
2	3	3	2	3	1	3	3	1	...	1	1

0	3	3	1	3	1	3	0	0	...	1	1	1	1
1	3	3	4	3	1	2	3	1	...	0	1	1	1
2	3	3	2	3	1	3	3	1	...	1	1	1	1

0	3	3	1	3	1	3	0	0	...	0	0	1
1	3	3	4	3	1	2	3	1	...	0	1	0
2	3	3	2	3	1	3	3	1	...	1	0	1

0	3	3	1	3	1	0	3	0	0	...	0	0	0	1	1
1	3	3	4	3	1	0	2	3	1	...	0	0	1	0	1
2	3	3	2	3	1	0	3	3	1	...	0	0	0	1	1
3	3	4	1	2	1	272	2	4	1	...	0	1	0	0	1
4	4	3	3	3	1	0	3	3	1	...	1	0	0	1	1

0	3	3	1	3	1	3	0	0	...	1	1
1	3	3	4	3	1	2	3	1	...	1	1
2	3	3	2	3	1	3	3	1	...	1	1