问题描述
我目前正在使用 隔离检测数据集中的异常值森林在Python中,我没有完全理解scikit-learn文档中给出的示例和解释
I am currently working on detecting outliers in my dataset using Isolation Forest in Python and I did not completely understand the example and explanation given in scikit-learn documentation
是否可以使用 Isolation Forest 检测具有 258 行 10 列的数据集中的异常值?
Is it possible to use Isolation Forest to detect outliers in my dataset that has 258 rows and 10 columns?
我是否需要单独的数据集来训练模型?如果是,是否有必要让训练数据集没有异常值?
Do I need a separate dataset to train the model? If yes, is it necessary to have that training dataset free from outliers?
这是我的代码:
rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))
我尝试将我的数据集加载到 X_train
但这似乎不起作用.
I tried by loading my dataset to X_train
but that does not seem to work.
推荐答案
简短的回答是否".您在相同的数据上训练和预测异常值.
Short answer is "No". You train and predict outliers on the same data.
IsolationForest
是一种无监督学习算法,旨在清除异常值中的数据(请参阅 docs 了解更多).在通常的机器学习设置中,你会运行它来清理你的训练数据集.就您的玩具示例而言:
IsolationForest
is an unsupervised learning algorithm that's intended to clean your data from outliers (see docs for more). In usual machine learning settings, you would run it to clean your training dataset. As far as your toy example concerned:
rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=rng, behaviour="new", contamination=.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_train
array([ 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, -1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
其中 1
表示内部值,-1
表示异常值.根据污染
参数的规定,异常值的比例为0.1
.
where 1
represent inliers and -1
represent outliers. As specified by contamination
param, the fraction of outliers is 0.1
.
最后,您将删除异常值,例如:
Finally, you would remove outliers like:
X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]
这篇关于Python中的隔离森林的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!