本文介绍了为什么Keras vs Random Forest或knn的结果不好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习keras的深度学习,并试图将结果(准确性)与机器学习算法(sklearn)(即random forestk_neighbors)进行比较

I'm learning deep learning with keras and trying to compare the results (accuracy) with machine learning algorithms (sklearn) (i.e random forest, k_neighbors)

使用keras似乎得到了最差的结果.我正在研究简单的分类问题:iris dataset我的keras代码看起来像:

It seems that with keras I'm getting the worst results.I'm working on simple classification problem: iris datasetMy keras code looks:

samples = datasets.load_iris()
X = samples.data
y = samples.target
df = pd.DataFrame(data=X)
df.columns = samples.feature_names
df['Target'] = y

# prepare data
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

# hot encoding
encoder = LabelEncoder()
y1 = encoder.fit_transform(y)
y = pd.get_dummies(y1).values

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# build model
model = Sequential()
model.add(Dense(1000, activation='tanh', input_shape = ((df.shape[1]-1),)))
model.add(Dense(500, activation='tanh'))
model.add(Dense(250, activation='tanh'))
model.add(Dense(125, activation='tanh'))
model.add(Dense(64, activation='tanh'))
model.add(Dense(32, activation='tanh'))
model.add(Dense(9, activation='tanh'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train)
score, acc = model.evaluate(X_test, y_test, verbose=0)

#results:
#score = 0.77
#acc = 0.711

我累了,似乎结果不高于0.85,因此添加层和/或更改每层的单位数和/或更改激活功能(更改为relu).

I have tired to add layers and/or change number of units per layer and/or change the activation function (to relu) by it seems that the result are not higher than 0.85.

使用sklearn random forestk_neighbors,我得到的结果(在同一数据集上)高于0.95.

With sklearn random forest or k_neighbors I'm getting result (on same dataset) above 0.95.

  1. 我想念什么?

  1. What am I missing ?

使用sklearn时,我付出了很少的努力,并获得了不错的结果;使用keras时,我进行了很多升级,但不如sklearn时的结果.为什么呢?

With sklearn I did little effort and got good results, and with keras, I had a lot of upgrades but not as good as sklearn results. why is that ?

如何使用keras获得相同的结果?

How can I get same results with keras ?

推荐答案

简而言之,您需要:

  1. ReLU激活
  2. 更简单的模型
  3. 数据规范化
  4. 更多纪元

详细信息:

这里的第一个问题是,如今我们从不从不使用activation='tanh'作为中间网络层.在此类问题中,我们实际上总是使用activation='relu'.

The first issue here is that nowadays we never use activation='tanh' for the intermediate network layers. In such problems, we practically always use activation='relu'.

第二个问题是您已经建立了一个很大的Keras模型,很可能是这样的情况,在您的训练集中只有100个虹膜样本的情况下,您太少的数据就无法有效地进行训练这么大的模型.尝试大幅度地减少层数和每层节点数.启动更简单.

The second issue is that you have build quite a large Keras model, and it might very well be the case that with only 100 iris samples in your training set you have too few data to effectively train such a large model. Try reducing drastically both the number of layers and the number of nodes per layer. Start simpler.

当我们拥有大量数据时,大型神经网络确实会蓬勃发展,但是,在像此处这样的小型数据集的情况下,与简单的算法(例如RF或k-nn.

Large neural networks really thrive when we have lots of data, but in cases of small datasets, like here, their expressiveness and flexibility may become a liability instead, compared with simpler algorithms, like RF or k-nn.

第三个问题是,与诸如随机森林之类的基于树的模型相比,神经网络通常需要对数据进行规范化,而您不需要这样做.事实是,knn还需要归一化的数据,但是在这里,由于所有虹膜特征都在相同的尺度上,因此在这种特殊情况下,它不会对性能产生负面影响.

The third issue is that, in contrast to tree-based models, like Random Forests, neural networks generally require normalizing the data, which you don't do. Truth is that knn also requires normalized data, but here, since all iris features are in the same scale, it does not affect the performance negatively in this special case.

最后但并非最不重要的是,您似乎只在一个时期内运行Keras模型(如果未在model.fit中指定任何内容,则为默认值);这有点等效于用一棵树建立一个随机森林(顺便说一句,它仍然).

Last but not least, you seem to run your Keras model for only one epoch (the default value if you don't specify anything in model.fit); this is somewhat equivalent to building a random forest with a single tree (which, BTW, is still much better than a single decision tree).

总而言之,您的代码进行了以下更改:

All in all, with the following changes in your code:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

model = Sequential()
model.add(Dense(150, activation='relu', input_shape = ((df.shape[1]-1),)))
model.add(Dense(150, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))

model.fit(X_train, y_train, epochs=100)

以及其他所有内容,我们得到:

and everything else as is, we get:

score, acc = model.evaluate(X_test, y_test, verbose=0)
acc
# 0.9333333373069763

我们可以做得更好:稍微使用 更多训练数据并将其分层,即

We can do better: use slightly more training data and stratify them, i.e.

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.20, # a few more samples for training
                                                    stratify=y)

并且具有相同的型号&训练时期,测试集中的 perfect 准确性为1.0:

And with the same model & training epochs you can get a perfect accuracy of 1.0 in the test set:

score, acc = model.evaluate(X_test, y_test, verbose=0)
acc
# 1.0

(由于此类实验默认情况下会施加一些随机性,因此细节可能会有所不同.

(Details might differ due to some randomness imposed by default in such experiments).

这篇关于为什么Keras vs Random Forest或knn的结果不好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-25 12:35