给定一个具有10,000个观测值和50个特征以及一个标签的数据集,假设火车/测试的比例为75%/ 25%,那么X_train,y_train,X_test和y_test的尺寸是多少?应该是
X_train:(2500, 50)
y_train: (2500, )
X_test: (7500, 50)
y_test: (7500, )
要么
X_train: (7500, 50)
y_train: (7500, )
X_test: (2500, 50)
y_test: (2500, )
最佳答案
您可以使用train_test_split
中的sklearn
自己查看:
import numpy as np
from sklearn.model_selection import train_test_split
n = 10000
p = 50
X = np.random.random((n,p))
y = np.random.randint(0,2,n)
test = 0.25
d = {}
d["X_train"], d["X_test"], d["y_train"], d["y_test"] = train_test_split(X,y,test_size=test)
for split in d:
print(split, d[split].shape)
X_train (7500, 50)
X_test (2500, 50)
y_train (7500,)
y_test (2500,)
关于python - Python机器学习标签和功能,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46015464/