本文介绍了快速方法将float类型的多列组合为array(float)类型的一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的数据集:

I have a dataset like this:

df = pd.DataFrame({
    "333-0": [123,123,123],
    "5985-0.0": [1,2,3],
    "5985-0.1":[1,2,3],
    "5985-0.2":[1,2,3]  
     },
     index = [0,1,2] )

在这里,我们有三列["5985-0.0", "5985-0.1", "5985-0.2"]代表事物5985-0的第一,第二和第三浮点读数-即.x代表数组索引.

Here, we have three columns ["5985-0.0", "5985-0.1", "5985-0.2"] that represent the first, second and third float readings of thing 5985-0 -- i.e. .x represents an array index.

我想采用多列并将其折叠为包含某些类型的float列表的单列5985-0,我可以这样做:

I'd like to take multiple columns and collapse them into a single column 5985-0 containing some kind of list of float, which I can do like this:

srccols = ["5985-0.0", "5985-0.1", "5985-0.2"]
df["5985-0"] = df[srccols].apply(tuple, axis=1)
df.dropna(srccols, axis=1)

   333-0     5985-0
0    123  (1, 1, 1)
1    123  (2, 2, 2)
2    123  (3, 3, 3)

然后我可以将其存储为带有数组列的SQL表.

which I can then store as an SQL table with an array column.

但是,apply(tuple)非常慢.是否有一种更快,更惯用的大熊猫方式将多列组合为一个.

However, apply(tuple) is very slow. Is there a faster, more idiomatic pandas way to combine multiple columns into one.

(第一个说归一化"的人会投票).

(First person to say "normalized" gets a downvote).

推荐答案

我的选择
假设我知道这些列

My Choice
Assuming I know the columns

thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)

   333-0     5985-0
0    123  (1, 1, 1)
1    123  (2, 2, 2)
2    123  (3, 3, 3)

基本案例
使用filterjoinapply(tuple, 1)

Base Case
Using filter, join, and apply(tuple, 1)

thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)

   333-0     5985-0
0    123  (1, 1, 1)
1    123  (2, 2, 2)
2    123  (3, 3, 3)

选项2
使用filterjoinpd.Series

Option 2
Using filter, join, pd.Series

thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)

   333-0     5985-0
0    123  [1, 1, 1]
1    123  [2, 2, 2]
2    123  [3, 3, 3]

选项3
使用filterjoinpd.Serieszip

Option 3
Using filter, join, pd.Series, and zip

thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
print(df.drop(cols, 1).join(s))  
   333-0     5985-0
0    123  (1, 1, 1)
1    123  (2, 2, 2)
2    123  (3, 3, 3)


时间
大数据集


Timing
Large Data Set

df = pd.concat([df] * 10000, ignore_index=True

%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
1 loop, best of 3: 350 ms per loop

%%timeit
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.06 ms per loop

%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.56 ms per loop

%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 6.89 ms per loop

这篇关于快速方法将float类型的多列组合为array(float)类型的一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-25 09:32