基于列的sklearn分层抽样

基于列的sklearn分层抽样

本文介绍了基于列的sklearn分层抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的CSV文件,其中包含我读入熊猫数据框的亚马逊评论数据.我想将数据分割为80-20(火车测试),但同时要确保分割的数据按比例表示一列(类别)的值,即火车中同时存在所有不同类别的评论并按比例测试数据.

I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one column (Categories), i.e all the different category of reviews are present both in train and test data proportionally.

数据如下:

**ReviewerID**       **ReviewText**        **Categories**       **ProductId**

1212                   good product         Mobile               14444425
1233                   will buy again       drugs                324532
5432                   not recomended       dvd                  789654123

我使用以下代码进行操作:

Im using the following code to do so:

import pandas as pd
Meta = pd.read_csv('C:\\Users\\xyz\\Desktop\\WM Project\\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split

train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)

出现以下错误

NameError: name 'y' is not defined

由于我是python的新手,所以我无法弄清楚自己在做错什么,或者该代码是否会根据列类别进行分层.当我从训练测试拆分中删除分层选项以及类别列时,它似乎工作正常.

As I'm relatively new to python I cant figure out what I'm doing wrong or whether this code will stratify based on column categories. It seems to work fine when i remove the stratify option as well as the categories column from train-test split.

任何帮助将不胜感激.

推荐答案

    >>> import pandas as pd
    >>> Meta = pd.read_csv('C:\\Users\\*****\\Downloads\\so\\Book1.csv')
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> y = Meta.pop('Categories')
    >>> Meta
        ReviewerID      ReviewText  ProductId
        0        1212    good product   14444425
        1        1233  will buy again     324532
        2        5432  not recomended  789654123
    >>> y
        0    Mobile
        1     drugs
        2       dvd
        Name: Categories, dtype: object
    >>> X = Meta
    >>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
    >>> X_test
        ReviewerID    ReviewText  ProductId
        0        1212  good product   14444425

这篇关于基于列的sklearn分层抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 16:11