java - 给定属性索引，WEKA生成的模型似乎无法预测类和分布

总览

我正在使用WEKA API 3.7.10（开发人员版本）来使用预制的.model文件。

我制作了25个模型：五个算法的五个结果变量。

J48 decision tree。
交替决策树
随机森林
LogitBoost
随机子空间

我在J48，随机子空间和随机森林方面遇到问题。

必要文件

以下是创建后我的数据的ARFF表示形式：

@relation WekaData

@attribute ageDiagNum numeric
@attribute raceGroup {Black,Other,Unknown,White}
@attribute stage3 {0,I,IIA,IIB,IIIA,IIIB,IIIC,IIINOS,IV,'UNK Stage'}
@attribute m3 {M0,M1,MX}
@attribute reasonNoCancerSurg {'Not performed, patient died prior to recommended surgery','Not recommended','Not recommended, contraindicated due to other conditions','Recommended but not performed, patient refused','Recommended but not performed, unknown reason','Recommended, unknown if performed','Surgery performed','Unknown; death certificate or autopsy only case'}
@attribute ext2 {00,05,10,11,13,14,15,16,17,18,20,21,23,24,25,26,27,28,30,31,33,34,35,36,37,38,40,50,60,70,80,85,99}
@attribute time2 {}
@attribute time4 {}
@attribute time6 {}
@attribute time8 {}
@attribute time10 {}

@data
65,White,IIA,MX,'Not recommended, contraindicated due to other conditions',14,?,?,?,?,?

我需要从各自的模型中获取二进制属性time2到time10。

以下是我用来从所有模型文件中获取预测的代码片段：

private static Map<String, Object> predict(Instances instances,
        Classifier classifier, int attributeIndex) {
    Map<String, Object> map = new LinkedHashMap<String, Object>();
    int instanceIndex = 0; // do not change, equal to row 1
    double[] percentage = { 0 };
    double outcomeValue = 0;
    AbstractOutput abstractOutput = null;

    if(classifier.getClass() == RandomForest.class || classifier.getClass() == RandomSubSpace.class) {
        // has problems predicting time2 to time10
        instances.setClassIndex(5);
    } else {
        // works as intended in LogitBoost and ADTree
        instances.setClassIndex(attributeIndex);
    }

    try {
        outcomeValue = classifier.classifyInstance(instances.instance(0));
        percentage = classifier.distributionForInstance(instances
                .instance(instanceIndex));
    } catch (Exception e) {
        e.printStackTrace();
    }

    map.put("Class", outcomeValue);

    if (percentage.length > 0) {
        double percentageRaw = 0;
        if (outcomeValue == new Double(1)) {
            percentageRaw = percentage[1];
        } else {
            percentageRaw = 1 - percentage[0];
        }
        map.put("Percentage", percentageRaw);
    } else {
        // because J48 returns an error if percentage[i] because it's empty
        map.put("Percentage", new Double(0));
    }

    return map;
}

这是我用来预测time2结果的模型，因此我们将使用索引6：

instances.setClassIndex(5);

ADTree model for time2 prediction
J48 model for time2 prediction
RandomForest model for time2 prediction
LogitBoost model for time2 prediction
RandomSubSpace model for time2 prediction

问题

如前所述，在我遵循"Use WEKA in your Java code"教程的同时，与其他三个方法相比，LogitBoost和ADTree在这种简单方法中没有问题。
[解决]根据我的调整，RandomForest和RandomSubSpace返回一个
ArrayOutOfBoundsException如果被告知要预测time2为time10。

java.lang.ArrayIndexOutOfBoundsException: 0
    at weka.classifiers.meta.Bagging.distributionForInstance(Bagging.java:586)
    at weka.classifiers.trees.RandomForest.distributionForInstance(RandomForest.java:602)
    at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70)

堆栈跟踪将根本错误指向该行：

outcomeValue = classifier.classifyInstance(instances.instance(0));

解决方案：在ARFF文件创建过程中，对于time2到time10的二进制变量，在FastVector<String>()的值分配给FastVector<Attribute>()对象时出现一些复制粘贴错误。我的RandomForest和RandomSubSpace十个模型现在都可以正常工作！

[已解决] J48 decision tree现在有一个新问题。现在不再提供任何预测，而是返回错误：

java.lang.ArrayIndexOutOfBoundsException: 11
    at weka.core.DenseInstance.value(DenseInstance.java:332)
    at weka.core.AbstractInstance.isMissing(AbstractInstance.java:315)
    at weka.classifiers.trees.j48.C45Split.whichSubset(C45Split.java:494)
    at weka.classifiers.trees.j48.ClassifierTree.getProbs(ClassifierTree.java:670)
    at weka.classifiers.trees.j48.ClassifierTree.classifyInstance(ClassifierTree.java:231)
    at weka.classifiers.trees.J48.classifyInstance(J48.java:266)

它追溯到线

outcomeValue = classifier.classifyInstance(instances.instance(0));

解决方案：实际上，我随机使用J48运行该程序，并且该程序有效-提供了结果变量和相关的分布。

我希望有人可以帮助我解决这个问题。我真的不知道这段代码有什么问题，因为我已经在线检查了Javadocs和示例，并且常量预测仍然持续存在。

（我目前正在检查WEKA GUI的主程序，但请在这里帮助我：-））

最佳答案

我现在只看过RandomForest问题。因为袋装课
从数据实例本身而不是模型中提取不同类的数量。
您在文本中说time2到time10是二进制的，但是您只是不在ARFF文件中说，
所以Bagging类不知道有多少个类。

因此，您只需在ARFF文件中指定time2是二进制文件，例如：
@attribute time2 {0,1}

而且您将不再获得任何例外。

我没有研究过J48问题，因为它与ARFF定义可能是同一问题。

测试代码：

  public static void main(String [] argv) {
      try {
        Classifier cls = (Classifier) weka.core.SerializationHelper.read("bosom.100k.2.j48.MODEL");
        J48 c = (J48)cls;

        DataSource source = new DataSource("data.arff");
        Instances data = source.getDataSet();
        data.setClassIndex(6);

        try {
            double outcomeValue = c.classifyInstance(data.instance(0));
            System.out.println("outcome "+outcomeValue);
            double[] p = c.distributionForInstance(data.instance(0));
            System.out.println(Arrays.toString(p));
        } catch (Exception e) {
            e.printStackTrace();
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

关于java - 给定属性索引，WEKA生成的模型似乎无法预测类和分布，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/21808033/