问题描述
我有一个像这样的数据集:
I have a dataset like this:
df = pd.DataFrame({
"333-0": [123,123,123],
"5985-0.0": [1,2,3],
"5985-0.1":[1,2,3],
"5985-0.2":[1,2,3]
},
index = [0,1,2] )
在这里,我们有三列["5985-0.0", "5985-0.1", "5985-0.2"]
代表事物5985-0
的第一,第二和第三浮点读数-即.x
代表数组索引.
Here, we have three columns ["5985-0.0", "5985-0.1", "5985-0.2"]
that represent the first, second and third float readings of thing 5985-0
-- i.e. .x
represents an array index.
我想采用多列并将其折叠为包含某些类型的float列表的单列5985-0
,我可以这样做:
I'd like to take multiple columns and collapse them into a single column 5985-0
containing some kind of list of float, which I can do like this:
srccols = ["5985-0.0", "5985-0.1", "5985-0.2"]
df["5985-0"] = df[srccols].apply(tuple, axis=1)
df.dropna(srccols, axis=1)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
然后我可以将其存储为带有数组列的SQL表.
which I can then store as an SQL table with an array column.
但是,apply(tuple)非常慢.是否有一种更快,更惯用的大熊猫方式将多列组合为一个.
However, apply(tuple) is very slow. Is there a faster, more idiomatic pandas way to combine multiple columns into one.
(第一个说归一化"的人会投票).
(First person to say "normalized" gets a downvote).
推荐答案
我的选择
假设我知道这些列
My Choice
Assuming I know the columns
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
基本案例
使用filter
,join
和apply(tuple, 1)
Base Case
Using filter
, join
, and apply(tuple, 1)
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
选项2
使用filter
,join
,pd.Series
Option 2
Using filter
, join
, pd.Series
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 [1, 1, 1]
1 123 [2, 2, 2]
2 123 [3, 3, 3]
选项3
使用filter
,join
,pd.Series
和zip
Option 3
Using filter
, join
, pd.Series
, and zip
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
print(df.drop(cols, 1).join(s))
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
时间
大数据集
Timing
Large Data Set
df = pd.concat([df] * 10000, ignore_index=True
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
1 loop, best of 3: 350 ms per loop
%%timeit
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.06 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.56 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 6.89 ms per loop
这篇关于快速方法将float类型的多列组合为array(float)类型的一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!