问题描述
我有一个具有唯一标识符和其他功能的数据集。看起来像这样
I have a dataset that has a unique identifier and other features. It looks like this
code>值,并为 y 值拆分 Response 。That will split only the values from the DataFrame not in ID or Response for the X values, and split Response for the y values.
但是您将仍然不能在该数据中使用 DecisionTreeClassifier ,因为它包含字符串。您将需要将具有分类数据的任何列(即 TypeA 和 TypeB )转换为数字表示形式。在我看来,对sklearn而言,最好的方法是使用。使用此命令会将分类字符串标签 ['M','S'] 转换为 [1、2] 可以使用 DecisionTreeClassifier 来实现。如果您需要一个示例,请查看。
But you will still not be able to use the DecisionTreeClassifier with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA and TypeB to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S'] into [1, 2] which can be implemented with the DecisionTreeClassifier. If you need an example take a look at Passing categorical data to sklearn decision tree.
更新
根据您的评论,我现在知道您需要映射回 ID 。在这种情况下,您可以利用熊猫来发挥自己的优势。将 ID 设置为数据索引,然后进行拆分,这样您将为所有数据保留 ID 值您的火车和测试数据。假设您的数据已经在熊猫数据框中。
Per your comment I now understand that you need to map back to the ID. In this case you can leverage pandas to your advantage. Set ID as the index of your data and then do the split, that way you will retain the ID value for all of your train and test data. Let's assume your data are already in a pandas dataframe.
df = df.set_index('ID') X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response) print(X_train) LenA TypeA LenB TypeB Diff Score ID 345-678 87 M 70 M 17 0.7 234-567 46 S 49 S 3 0.9这篇关于将预测映射回ID-Python Scikit了解DecisionTreeClassifier的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!