问题描述
注意:此问题确实是 Split的重复pandas dataframe字符串条目用于分隔行,但是此处提供的答案更加通用和有益,因此,在所有方面,我选择不删除线程
note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread
我有一个具有以下格式的数据集":
I have a 'dataset' with the following format:
id | value | ...
--------|-------|------
a | 156 | ...
b,c | 457 | ...
e,g,f,h | 346 | ...
... | ... | ...
,我想通过复制每个id的所有值来对其进行归一化:
and I would like to normalize it by duplicating all values for each ids:
id | value | ...
--------|-------|------
a | 156 | ...
b | 457 | ...
c | 457 | ...
e | 346 | ...
g | 346 | ...
f | 346 | ...
h | 346 | ...
... | ... | ...
我正在做的是使用.groupby
应用pandas
的拆分应用组合原理,为每个组(groupby value, pd.DataFrame())
创建一个tuple
What I'm doing is applying the split-apply-combine principle of pandas
using .groupby
that creates a tuple
for each group (groupby value, pd.DataFrame())
我创建了一个列进行分组,该列仅对行中的id进行计数:
I created a column to group by that simply counts the ids in the row:
df['count_ids'] = df['id'].str.split(',').apply(lambda x: len(x))
id | value | count_ids
--------|-------|------
a | 156 | 1
b,c | 457 | 2
e,g,f,h | 346 | 4
... | ... | ...
我复制行的方式如下:
pd.DataFrame().append([group]*count_ids)
我的进度很慢,但是确实很复杂,对于能与这类问题分享的最佳实践或建议,我将不胜感激.
I'm slowly progressing, but it is really complex, and I would appreciate any best practice or recommendation you can share with this type of problems.
推荐答案
尝试一下:
In [44]: df
Out[44]:
id value
0 a 156
1 b,c 457
2 e,g,f,h 346
In [45]: (df['id'].str.split(',', expand=True)
....: .stack()
....: .reset_index(level=0)
....: .set_index('level_0')
....: .rename(columns={0:'id'})
....: .join(df.drop('id',1), how='left')
....: )
Out[45]:
id value
0 a 156
1 b 457
1 c 457
2 e 346
2 g 346
2 f 346
2 h 346
说明:
In [48]: df['id'].str.split(',', expand=True).stack()
Out[48]:
0 0 a
1 0 b
1 c
2 0 e
1 g
2 f
3 h
dtype: object
In [49]: df['id'].str.split(',', expand=True).stack().reset_index(level=0)
Out[49]:
level_0 0
0 0 a
0 1 b
1 1 c
0 2 e
1 2 g
2 2 f
3 2 h
In [50]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0')
Out[50]:
0
level_0
0 a
1 b
1 c
2 e
2 g
2 f
2 h
In [51]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0').rename(columns={0:'id'})
Out[51]:
id
level_0
0 a
1 b
1 c
2 e
2 g
2 f
2 h
In [52]: df.drop('id',1)
Out[52]:
value
0 156
1 457
2 346
这篇关于通过复制对数据进行归一化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!