摘要:
我想将表示动作开始和结束的时间序列代码(大数据集)排列成甘特图,因此我需要将它们重新分组为任务(名称),开始(时间)和完成(时间) ) 列。但是到目前为止,我只能使用for循环在每行上非常缓慢地进行迭代:(
(我一直在尝试groupby和pivot,但我只是还不太了解这些内容,无法让它们执行我想要的操作。)
键
我有一个带有开始代码,结束代码和操作标签的“关键”字典/ df。简化示例:
import pandas as pd
code_key_cols = ["start_code", "end_code", "label"]
code_key = [[1, 2, "a"],
[3, 4, "b"],
[5, 6, "c"],
[7, 8, "d"]]
code_df = pd.DataFrame(code_key, columns=code_key_cols)
Out[]: start_code end_code label
0 1 2 a
1 3 4 b
2 5 6 c
3 7 8 d
数据
然后,我有一堆数据,这些数据只是这些代码的时间序列。
我想以一种绘制甘特图的方式来组织它们。
对于需要执行任务的情节,请在开始,完成列中进行。
(例如,此处仅创建伪数据,模仿实际数据的行为,其中相同的操作类型不能并行执行两次,而只能并行执行)
from random import shuffle
data = []
for i in range(3000):
start_codes = [x for x in code_df.iloc[:, 0]]
end_codes = [x for x in code_df.iloc[:, 1]]
shuffle(start_codes)
shuffle(end_codes)
[data.append(x) for x in start_codes]
[data.append(x) for x in end_codes]
data_cols = ["code", "time"]
data_df = pd.DataFrame()
data_df['code'] = data
data_df['time'] = pd.date_range(start="19700101", periods=len(data))
print(data_df.head())
code time
0 3 1970-01-01
1 1 1970-01-02
2 7 1970-01-03
3 5 1970-01-04
4 2 1970-01-05
我的尝试:
我可以做到,但只能以非常缓慢的方式,逐行迭代!我敢肯定,熊猫有一种更有效的方法。你会怎么做?
这就是我的方法,但是使用df的12K行需要13s :(
import numpy as np
lst = []
for _, code_row in code_df.iterrows():
begin = True
task = np.nan
start = np.nan
finish = np.nan
for _, data_row in data_df.iterrows():
if begin:
if code_row['start_code'] == data_row['code']:
task = code_row.label
start = data_row.time
begin = False
else:
if code_row['end_code'] == data_row['code']:
finish = data_row.time
begin = True
lst.append([task, start, finish])
df3 = pd.DataFrame(data=lst, columns=["Task", 'Start', 'Finish'])
输出量
对于上下文,我将显示目标,并使用以下代码绘制甘特图(为简化起见,将i的范围从3000更改为10)。
import plotly.figure_factory as ff
import plotly.io as pio
pio.renderers.default = "browser"
fig = ff.create_gantt(df3, group_tasks=True)
fig.show()
顺便说一句,如果你读了这么远,非常感谢你的时间! :)
最佳答案
希望这可以帮助。这应该给您相同的输出:
# we'll create a new dataframe out of two slices on data_df (resulting in two new dataframes), namely those rows belonging to start_code and those belonging to end_code.
# next, sort the slices on code and time such that our slices match in order (this builds on the concurrent assumption you stated)
# drop unwanted columns and rename others as desired
# reset indices as otherwise pd.concat tries to adhere to the old indices
# merge the labels from code_df
df3_new = pd.concat([
data_df[data_df.code.isin(code_df.start_code)]
.sort_values(['code', 'time'])
.reset_index(drop=True)
.rename(columns={'time': 'Start'}),
data_df[data_df.code.isin(code_df.end_code)]
.sort_values(['code', 'time'])
.reset_index(drop=True)
.rename(columns={'time': 'Finish'})
.drop('code', axis=1)
], axis=1) \
.merge(code_df, how='left', left_on='code', right_on='start_code') \
.drop(['code', 'start_code', 'end_code'], axis=1) \
.rename(columns={'label': 'Task'})
# which yields the same outcome (for the given set at least)
df3.equals(df3_new.loc[:, ['Task','Start', 'Finish']])
True
在给定的集合上具有以下平均表现:
12.5 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)