因此,在添加新功能之前,我的随机森林分类器运行良好。当我尝试运行它时,我不断收到以下错误代码:
\Anaconda2\lib\site-packages\sklearn\utils\validation.pyc in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
训练和测试都是从csv文件读取的np.DataFrame对象。我试图添加更多功能以更好地预测变量,但是每当我尝试拟合时最终都会遇到上述错误。我确实尝试删除了NaN和无限值,但仍然遇到相同的错误。
下面是我的代码:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
def features(df):
df["num_photos"] = df["photos"].apply(len)
df["num_features"] = df["features"].apply(len)
df["year_created"] = df["created"].dt.year
df["month_created"] = df["created"].dt.month
df["day_created"] = df["created"].dt.day
df["desc_len"] = df["description"].apply(lambda x: len(x.split(" ")))
#New features begin here
df["pricePerBed"] = df['price'] / df['bedrooms']
df["pricePerBath"] = df['price'] / df['bathrooms']
df["pricePerRoom"] = df['price'] / (df['bedrooms'] + df['bathrooms'])
df["bedPerBath"] = df['bedrooms'] / df['bathrooms']
df["bedBathDiff"] = df['bedrooms'] - df['bathrooms']
df["bedBathSum"] = df["bedrooms"] + df['bathrooms']
df["bedsPerc"] = df["bedrooms"] / (df['bedrooms'] + df['bathrooms'])
df = df.replace([np.inf, -np.inf], np.nan)
df = df.fillna(1)
return df
features(train)
features(test)
key_features = ["bathrooms", "bedrooms", "latitude", "longitude", "year_created",
"month_created", "day_created", "price", "num_photos", "num_features", "desc_len",
"pricePerBed",
"pricePerBath",
"pricePerRoom",
#"bedPerBath",
"bedBathDiff",
"bedBathSum"]
X = train[key_features]
y = train["interest_level"]
X.fillna(1) #I tried getting rid of NaN
X.isnull().any()
bedPerBath变量为isull()。any()提供了True,因此我将其遗漏了,其余的都给了我False。但是,当我尝试拟合估计量时,我仍然会收到“ ValueError”。
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size = 0.3)
X_train.isnull().any()
clfRF = RandomForestClassifier(n_estimators = 1000)
clfRF.fit(X_train, y_train)
#CV
y_cv_pred = clfRF.predict_proba(X_cv)
log_loss(y_cv, y_cv_pred)
我注意到错误消息说对于dtype('float32')太大,而我的值主要是float64,这是否可能导致错误?如果可以,为什么?
谢谢。
最佳答案
尝试:
import numpy as np
X_train, X_cv, y_train, y_cv = train_test_split(np.nan_to_num(X), y, test_size = 0.3)
clfRF = RandomForestClassifier(n_estimators = 1000)
clfRF.fit(X_train, y_train)
#CV
y_cv_pred = clfRF.predict_proba(X_cv)
log_loss(y_cv, y_cv_pred)