本文介绍了根据Pandas中以竖线分隔的列创建多个新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据框,其中的管道分隔列包含任意数量的元素,称为零件".这些管道字符串中的元素数量从0到超过10不等.所有管道字符串中包含的唯一元素的数量并不比行数小得多(这使我无法在同时手动指定所有元素创建新列).

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).

对于每一行,我想创建一个新列,用作管道分隔列表的每个元素的指示符变量.例如,如果该行

For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row

... 'Parts' ...

... '12|34|56'

应转换为

... 'Part_12' 'Part_34' 'Part_56' ...

... 1 1 1 ...

由于它们有很多独特的部分,因此这些列显然将是稀疏的-多数为零,因为每一行仅包含一小部分独特的部分.

Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.

我没有找到不需要手动指定列的任何方法(例如,).我还研究了熊猫的融化,但我认为这不是合适的工具.

I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).I've also looked at pandas' melt, but I don't think that's the appropriate tool.

我知道如何解决该问题的方法是将原始CSV管道传输到另一个python脚本并逐个字符地处理它,但是我需要在现有的脚本中工作,因为我将处理数百个CSV格式.

The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.

这是数据的更好说明

ID YEAR AMT PARTZ

1202 2007 99.34

9321 1988 1012.99 2031|8942

2342 2012 381.22 1939|8321|Amx3

推荐答案

您可以使用get_dummiesadd_prefix:

df.Parts.str.get_dummies().add_prefix('Part_')

输出:

   Part_12  Part_34  Part_56
0        1        1        1

编辑以评论和计算重复项.

df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')

输出:

   Part_12  Part_34  Part_56
0        2        1        1

这篇关于根据Pandas中以竖线分隔的列创建多个新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:24
查看更多