回帰分析と機械学習で中央線の高コスパ物件を探す（家賃予測モデル生成）

前回は、データの可視化と変数選択を行いました。

pompom168.hatenablog.com

今回は、本格的に家賃予測モデルを生成します。スクレイピングした物件の、8割を学習に、2割を評価のテスト用に使用することにします。

使用する変数

説明変数

部屋数、間取りK有無、間取りL有無、間取りS有無、築年数、建物高さ、部屋のある階、徒歩時間、駅（中野）、駅（阿佐ヶ谷）、駅（高円寺）、駅（お荻窪）、駅（西荻窪）、駅（吉祥寺）、駅（三鷹）、駅（武蔵境）、駅（東小金井）、駅（武蔵小金井）、駅（国分寺）、駅（西国分寺）、駅（国立）

応答変数

家賃+管理費

重回帰分析による回帰モデル

初めに重回帰分析です。

pythonのライブラリである、scikit-learnを使って実装しました。

以下、ソースコードです。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

df = pd.read_csv('input_bukken_chuoline.csv', sep = '\t',encoding='utf-16')

#不要な列を削除
df.drop(['Unnamed: 0'], axis=1, inplace=True)

#分析後出力される変数
coef = np.zeros((21,1))
intercept = np.zeros(1)
score = np.zeros(1)

#root mean square error（RMSE）
rmse = np.zeros(1)

#8割を学習、2割をテストに使用
dfTrain = df[0:int(len(df.index) * 0.8)]
dfTest = df[int(len(df.index) * 0.8):len(df.index)]
    
#モデルの定義
clf = linear_model.LinearRegression()
 
# 説明変数に "賃料+管理費","間取りK", "専有面積"以外 を利用
dependentVar = dfTrain.drop(["賃料+管理費","間取りK", "専有面積"], axis=1)
dependentVar = dependentVar.apply(lambda x: (x - np.mean(x)) / np.std(x))
dependentVar.head()
X = dependentVar.as_matrix()
 
# 目的変数に "賃料+管理費”を利用
Y = dfTrain['賃料+管理費'].as_matrix()
 
# 予測モデルを作成
clf.fit(X, Y)

# 回帰係数
coef[:,0] = clf.coef_
 
# 切片 (誤差)
intercept[0] = clf.intercept_
 
# 決定係数
score[0] = clf.score(X, Y)

#テストの説明変数
dependentVarTest = dfTest.drop(["賃料+管理費","間取りK", "専有面積"], axis=1)
dependentVarTest = dependentVarTest.apply(lambda x: (x - np.mean(x)) / np.std(x))
dependentVarTest.head()
testX = dependentVarTest.as_matrix()
 
#テストの正解
ansY = dfTest['賃料+管理費'].as_matrix()

#テストの予測結果
testY = clf.predict(testX)

#RMSE計算
rmse[0] = np.sqrt(np.mean(np.square(np.array(testY - ansY))))

#結果可視化
idx = np.argsort(np.array(ansY))
plt.plot(np.arange(0,len(ansY)),np.array(ansY)[idx], color='blue')
plt.plot(np.arange(0,len(testY)),np.array(testY)[idx], alpha=0.4)
plt.savefig("正規化＿予測＆正解")
plt.show()

結果を見てみます。

まずは、作成したモデルの偏回帰係数を見てみます。 f:id:pompom168:20171204161939p:plain

部屋数が増えれば10486円家賃が上がる、築年数が1年経てば家賃が4866円下がるなど、妥当なように感じます。

駅に関しては、武蔵境より以西の物件は家賃が下がるようです。

次に、テストデータの実際の家賃と予測された家賃を重ねてプロットしてみます。

横軸が物件のindex(各物件)、縦軸が家賃、青線が実際の家賃、赤線が予測された家賃を示します。 f:id:pompom168:20171204163030p:plain

ある程度は当たっている気もしますが、特に安い物件と特に高い物件では予測に失敗しています。

Random Forestによる回帰モデル

機械学習の内、アンサンブル学習の1つであるRandom Forestを使用して予測モデルを作成します。

こちらも、pythonのライブラリであるscikit-learnを使って実装しました。

以下のパラメーターをグリッドサーチしました。

決定木の数(n_estimators)：[5, 10, 30, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]

各決定木で使用する説明変数の最大数(max_features)：['auto'(=sqrt(説明変数の数))]

ノード分割に必要な最小サンプル数(min_sample_split)：[2, 3, 5, 10]

決定木の最大深さ(max_depth)：[None(=設定無), 3, 5, 10]

以下、ソースコードです。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('input_bukken_chuoline.csv', sep = '\t',encoding='utf-16')

#不要な列を削除
df.drop(['Unnamed: 0'], axis=1, inplace=True)

#グリッドサーチするパラメーター
parameters = {
    'n_estimators'      : [5, 10, 30, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
    'max_features'      : ['auto'],
    'random_state'      : [0],
    'min_samples_split' : [2, 3, 5, 10],
    'max_depth'         : [None, 3, 5, 10]
}

#root mean square error（RMSE）
rmse = np.zeros(1)

#8割を学習、2割をテストに使用    
dfTrain = df[0:int(len(df.index) * 0.8)]
dfTest = df[int(len(df.index) * 0.8):len(df.index)]
    
#モデルの定義
clf = GridSearchCV(RandomForestRegressor(), parameters)
 
# 説明変数に "賃料+管理費","間取りK", "専有面積"以外 を利用
dependentVar = dfTrain.drop(["賃料+管理費","間取りK", "専有面積"], axis=1)
X = dependentVar.as_matrix()
 
# 目的変数に "賃料+管理費”を利用
Y = dfTrain['賃料+管理費'].as_matrix()
 
# 予測モデルを作成
clf.fit(X, Y)

print(clf.best_estimator_)

#テストの説明変数
dependentVarTest = dfTest.drop(["賃料+管理費","間取りK", "専有面積"], axis=1)
testX = dependentVarTest.as_matrix()
 
#テストの正解
ansY = dfTest['賃料+管理費'].as_matrix()

#テストの予測結果
testY = clf.predict(testX)

#RMSE計算
rmse[0] = np.sqrt(np.mean(np.square(np.array(testY - ansY))))

#結果可視化
idx = np.argsort(np.array(ansY))
plt.plot(np.arange(0,len(ansY)),np.array(ansY)[idx], color='blue')
plt.plot(np.arange(0,len(testY)),np.array(testY)[idx], alpha=0.4)
plt.savefig("randomForest_正規化＿予測＆正解")
plt.show()