问题描述
我写了一个Web爬虫,从产品表中提取信息并建立一个数据框.数据表的描述"列包含用逗号分隔的描述产品的属性字符串.我想在数据框中为每个唯一属性创建一列,并用属性的子字符串填充该列中的行.下面是df示例.
I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attributes describing the product. I want to create a column in the dataframe for every unique attribute and populate the row in that column with the attribute's substring. Example df below.
PRODUCTS DATE DESCRIPTION
Product A 2016-9-12 Steel, Red, High Hardness
Product B 2016-9-11 Blue, Lightweight, Steel
Product C 2016-9-12 Red
我认为第一步是将描述分成一个列表.
I figure the first step is to split the description into a list.
In: df2 = df['DESCRIPTION'].str.split(',')
Out:
DESCRIPTION
['Steel', 'Red', 'High Hardness']
['Blue', 'Lightweight', 'Steel']
['Red']
我想要的输出如下表所示.列名不是特别重要.
My desired output looks like the table below. The column names are not particularly important.
PRODUCTS DATE STEEL_COL RED_COL HIGH HARDNESS_COL BLUE COL LIGHTWEIGHT_COL
Product A 2016-9-12 Steel Red High Hardness
Product B 2016-9-11 Steel Blue Lightweight
Product C 2016-9-12 Red
我相信可以使用Pivot来设置列,但是我不确定在建立列之后使用最Python的方式填充列.感谢您的帮助.
I believe the columns can be set up using a Pivot but I'm not sure the most Pythonic way to populate the columns after establishing them. Any help is appreciated.
非常感谢您的回答.我选择@MaxU的响应是正确的,因为它似乎稍微灵活一些,但是@piRSquared的响应却非常相似,甚至可能被认为是更Python化的方法.我测试了两个版本,并且都做了我需要的工作.谢谢!
Thank you very much for the answers. I selected @MaxU's response as correct since it seems slightly more flexible, but @piRSquared's gets a very similar result and may even be considered the more Pythonic approach. I tested both version and both do what I needed. Thanks!
推荐答案
您可以建立一个稀疏矩阵:
you can build up a sparse matrix:
In [27]: df
Out[27]:
PRODUCTS DATE DESCRIPTION
0 Product A 2016-9-12 Steel, Red, High Hardness
1 Product B 2016-9-11 Blue, Lightweight, Steel
2 Product C 2016-9-12 Red
In [28]: (df.set_index(['PRODUCTS','DATE'])
....: .DESCRIPTION.str.split(',\s*', expand=True)
....: .stack()
....: .reset_index()
....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size')
....: )
Out[28]:
0 Blue High Hardness Lightweight Red Steel
PRODUCTS DATE
Product A 2016-9-12 0 1 0 1 1
Product B 2016-9-11 1 0 1 0 1
Product C 2016-9-12 0 0 0 1 0
In [29]: (df.set_index(['PRODUCTS','DATE'])
....: .DESCRIPTION.str.split(',\s*', expand=True)
....: .stack()
....: .reset_index()
....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size')
....: )
Out[29]:
0 Blue High Hardness Lightweight Red Steel
PRODUCTS DATE
Product A 2016-9-12 1 1 1
Product B 2016-9-11 1 1 1
Product C 2016-9-12 1
这篇关于Pandas列包含列表,如何将唯一列表元素转换为列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!