我在 sklearn cross_validation train_test_split 模块中使用了 Pandas 数据框。

d=pandas.DataFrame({'a':np.random.randn(300),
                    'c':np.array([el for el in np.ones(100)]+
                                 [el for el in np.zeros(200)])})
from sklearn import cross_validation
(X,y)=(d['a'],d['c'])

这有效
X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0)
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0)

为什么这不起作用?
X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0,stratify=y)

in _is_valid_list_like(self, key, axis)
   1536         l = len(ax)
   1537         if len(arr) and (arr.max() >= l or arr.min() < -l):
-> 1538             raise IndexError("positional indexers are out-of-bounds")
   1539
   1540         return True

IndexError: positional indexers are out-of-bounds

最佳答案

TL;DR:您对 train_test_split 的第二次调用使用的 stratify 数组长度与您使用的 y 不同。使用 stratify=y_train_and_cv

首先,附带一点说明:cross_validation (0.17.1 docs here ) 将很快被弃用,您应该使用 model_selection.train_test_split (0.18.1) 代替。我将导入 train_test_split itself 以缩短以下内容的长度:

# Same as this in older versions:
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split

这可以:
X_train_and_cv, X_test,y_train_and_cv,y_test = train_test_split(X,y,
                                                                test_size=0.2,
                                                                random_state=0,
                                                                stratify=y)

这不好,因为 y=y_train_and_cv (len=240) stratify=y (len=300)
X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv,
                                              y_train_and_cv,
                                              test_size=0.2,
                                              random_state=0,
                                              stratify=y)

替换为:
X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv,
                                              y_train_and_cv,
                                              test_size=0.2,
                                              random_state=0,
                                              stratify=y_train_and_cv)

关于python - 索引错误 : positional indexers are out-of-bounds stratify sklearn test_train_split,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/40645522/

10-13 01:17