我目前正在使用免费的UCI乳腺癌.arff文件练习WEKA建模的绳索,并且在这里的各种文章中,我都可以调整它的准确性,范围从63%到73%。我在Windows 7 Starter计算机中使用WEKA 3.7.10


我使用属性选择来减少将InfoGainAttributeEvalRanker一起使用的变量数量。我选择了前五名,结果如下:

Evaluator:    weka.attributeSelection.InfoGainAttributeEval
Search:       weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation:     breast-cancer
Instances:    286
Attributes:   10
             age
             menopause
             tumor-size
             inv-nodes
             node-caps
             deg-malig
             breast
             breast-quad
             irradiat
             Class
Evaluation mode:    10-fold cross-validation



=== Attribute selection 10 fold cross-validation (stratified), seed: 1 ===

average merit      average rank  attribute
0.078 +- 0.011     1.3 +- 0.64    6 deg-malig
0.071 +- 0.01      1.9 +- 0.3     4 inv-nodes
0.061 +- 0.008     3   +- 0.77    3 tumor-size
0.051 +- 0.007     3.8 +- 0.4     5 node-caps
0.026 +- 0.006     5   +- 0       9 irradiat
0.012 +- 0.003     6.4 +- 0.49    1 age
0.01  +- 0.003     6.6 +- 0.49    8 breast-quad
0.003 +- 0.001     8.5 +- 0.5     7 breast
0.003 +- 0.002     8.5 +- 0.5     2 menopause

删除排名较低的变量后,我继续创建模型。我之所以选择“多层感知器”,是因为它是我研究所依据的期刊中必需的算法。


0.1用于learning ratemomentumsuggestion of Bernhard Pfahringe以及用于hidden nodesepoch的指数因子1、2、4、8的因数,依此类推。

在对该方法进行了几次尝试之后,我注意到了一种使用2作为隐藏层的模式,以及一个等效于二进制数的小数形式,即。 512、1024、2048,...,从而提高了准确性。例如,hidden node为2,而epoch为1024,依此类推。

我得到了一系列不同的结果,但是到目前为止,我得到的最高结果是以下结果(使用hidden node 2和epoch 16384:

    Scheme:       weka.classifiers.functions.MultilayerPerceptron -L 0.1 -M 0.1 -N 16384 -V 0 -S 0 -E 20 -H 2
    Relation:     breast-cancer-weka.filters.unsupervised.attribute.Remove-R1-2,7-8
    Instances:    286
    Attributes:   6
                  tumor-size
                  inv-nodes
                  node-caps
                  deg-malig
                  irradiat
                  Class
    Test mode:    10-fold cross-validation

    === Classifier model (full training set) ===

    Sigmoid Node 0
        Inputs    Weights
        Threshold    -2.4467109489840375
        Node 2    2.960926490700117
        Node 3    1.5276384018358489
    Sigmoid Node 1
        Inputs    Weights
        Threshold    2.446710948984037
        Node 2    -2.9609264907001167
        Node 3    -1.5276384018358493
    Sigmoid Node 2
        Inputs    Weights
        Threshold    0.8594931368555995
        Attrib tumor-size=0-4    -0.6809394102558067
        Attrib tumor-size=5-9    -0.7999278705976403
        Attrib tumor-size=10-14    -0.5139914771540879
        Attrib tumor-size=15-19    2.3071396030112834
        Attrib tumor-size=20-24    -6.316868254289899
        Attrib tumor-size=25-29    5.535754474315768
        Attrib tumor-size=30-34    -12.31495416708197
        Attrib tumor-size=35-39    2.165860489861981
        Attrib tumor-size=40-44    10.740913335424047
        Attrib tumor-size=45-49    9.102261927484186
        Attrib tumor-size=50-54    -17.072392893550735
        Attrib tumor-size=55-59    0.043056333044031
        Attrib inv-nodes=0-2    9.578867366884618
        Attrib inv-nodes=3-5    1.3248317047328586
        Attrib inv-nodes=6-8    -5.081199984305494
        Attrib inv-nodes=9-11    -8.604844224457239
        Attrib inv-nodes=12-14    2.2330604430275907
        Attrib inv-nodes=15-17    -2.8692154868988355
        Attrib inv-nodes=18-20    0.04225234708199947
        Attrib inv-nodes=21-23    0.017664071511846485
        Attrib inv-nodes=24-26    -0.9992481277256989
        Attrib inv-nodes=27-29    -0.02737484354173595
        Attrib inv-nodes=30-32    -0.04607516719307534
        Attrib inv-nodes=33-35    -0.038969156415242706
        Attrib inv-nodes=36-39    0.03338452826774849
        Attrib node-caps    6.764954936579671
        Attrib deg-malig=1    -5.037151186065571
        Attrib deg-malig=2    12.469858109768378
        Attrib deg-malig=3    -8.382625277311769
        Attrib irradiat    8.302010702287868
    Sigmoid Node 3
        Inputs    Weights
        Threshold    -0.7428771456532647
        Attrib tumor-size=0-4    3.5709673152321555
        Attrib tumor-size=5-9    3.563713261511895
        Attrib tumor-size=10-14    7.86118954430952
        Attrib tumor-size=15-19    2.8762105204084167
        Attrib tumor-size=20-24    4.60168522637948
        Attrib tumor-size=25-29    -5.849391383398816
        Attrib tumor-size=30-34    -1.6805815971562046
        Attrib tumor-size=35-39    -12.022394228003419
        Attrib tumor-size=40-44    11.922229608392747
        Attrib tumor-size=45-49    -1.9939414047194557
        Attrib tumor-size=50-54    -5.9801974214306215
        Attrib tumor-size=55-59    -0.04909236196295539
        Attrib inv-nodes=0-2    5.569516359775502
        Attrib inv-nodes=3-5    -7.871275549119543
        Attrib inv-nodes=6-8    3.405277467966008
        Attrib inv-nodes=9-11    -0.3253699778307026
        Attrib inv-nodes=12-14    1.244234346055825
        Attrib inv-nodes=15-17    1.179311225120216
        Attrib inv-nodes=18-20    0.03495291263409073
        Attrib inv-nodes=21-23    0.0043299366591334695
        Attrib inv-nodes=24-26    0.6595250300030937
        Attrib inv-nodes=27-29    -0.02503529326219822
        Attrib inv-nodes=30-32    0.041787638417097844
        Attrib inv-nodes=33-35    0.008416652090130837
        Attrib inv-nodes=36-39    -0.014551878794926747
        Attrib node-caps    4.7997880904143955
        Attrib deg-malig=1    1.6752746955482163
        Attrib deg-malig=2    6.130488722916935
        Attrib deg-malig=3    -6.989852429736567
        Attrib irradiat    8.716254786514295
    Class no-recurrence-events
        Input
        Node 0
    Class recurrence-events
        Input
        Node 1


    Time taken to build model: 27.05 seconds

    === Stratified cross-validation ===
    === Summary ===

    Correctly Classified Instances         210               73.4266 %
    Incorrectly Classified Instances        76               26.5734 %
    Kappa statistic                          0.2864
    Mean absolute error                      0.3312
    Root mean squared error                  0.4494
    Relative absolute error                 79.1456 %
    Root relative squared error             98.3197 %
    Coverage of cases (0.95 level)          98.951  %
    Mean rel. region size (0.95 level)      97.7273 %
    Total Number of Instances              286

    === Detailed Accuracy By Class ===

                     TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                     0.891    0.635    0.768      0.891    0.825      0.300    0.633     0.748     no-recurrence-events
                     0.365    0.109    0.585      0.365    0.449      0.300    0.633     0.510     recurrence-events
    Weighted Avg.    0.734    0.479    0.714      0.734    0.713      0.300    0.633     0.677

    === Confusion Matrix ===

       a   b   <-- classified as
     179  22 |   a = no-recurrence-events
      54  31 |   b = recurrence-events


我的问题是,如何才能将数据的准确性至少提高到90%?
我是否必须进行过滤,使用另一种MLP输入参数模式?

我计划在学习如何使用之后再使用另一组数据(它包含约50个变量和100,000个实例)。

最佳答案

对于这样的问题,显然没有好的答案,但是我将为您提供使用MLP的一些或多或少的一般提示:


首先,为什么要在处理如此小的数据集时删除要素?特征选择在高维问题和/或计算昂贵的模型中很重要。对于乳腺癌和MLP来说都不是正确的。
迭代计数是MLP的最糟糕的停止标准,您应该在验证错误增加时停止训练,而不是经过一定数量的迭代后停止训练
我不知道您使用什么成本函数,但是最重要的部分是正则化,因为MLP容易过度拟合。某些Tikhonov正则化是最低要求。
为这个问题使用多个隐藏层是完全多余的。特别是,由于梯度现象的消失,在MLP中训练多个隐藏层通常是不可能的。
为了摆脱学习算法的参数化,我还建议您放弃朴素的算法,并至少使用残余传播,这被证明在许多应用中效果很好。

08-25 03:05