问题描述
我目前正在处理从工程工厂收到的一些数据,数据(大致)如下:
I'm currently working with some data that I receive from an engineering plant, the data comes out(roughly) as the following :
df = pd.DataFrame({'ID' : np.random.randint(1,25,size=5),
'on/off' : np.random.randint(0,2,size=5),
'Time' : pd.date_range(start='01/01/2019',periods=5,freq='5s')})
print(df)
ID on/off Time
0 17 0 2019-01-01 00:00:00
1 21 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 1 2019-01-01 00:00:15
4 12 0 2019-01-01 00:00:20
on/off 列中的 0 和 1 对应机器何时开启或关闭(0 = 开启 1 = 关闭)
the 0 and 1 in the on/off column correspond to when a machine is on or off (0 = on 1 = off)
目前,我使用下面这行漂亮的代码来获取我的列在数据滚动时的差异
currently, I use the following line of beautiful code to get the difference between my column as the data is rolling
df['Time Difference'] = (df.time - df.time.shift())
print(df)
ID on/off Time Time Difference
0 17 0 2019-01-01 00:00:00 NaT
1 21 0 2019-01-01 00:00:05 00:00:05
2 12 1 2019-01-01 00:00:10 00:00:05
3 12 1 2019-01-01 00:00:15 00:00:05
4 12 0 2019-01-01 00:00:20 00:00:05
现在因为这个数据框非常冗长(每周我会收到大约 15 万行)
now as this dataframe is quite verbose (each week I'll receive about 150k rows)
总结机器关闭时间(其中 df['on/off] == 1)直到下一个 0 出现的最佳方法是什么?所以在上面 2019 年 1 月 1 日的例子中,ID 12 的机器没有运行 15 秒,直到它在 00:00:20 恢复
what would be the best way to sum amount time a machine is off (where df['on/off] == 1) until the next 0 comes along? so in the above example for the 1st of January 2019 the machine of ID 12 didn't run for 15 seconds until it resumed at 00:00:20
我真的很困惑这个..!我可以按 ID、开关、日期、时间戳对其进行分组,但我不确定如何在单个列中滚动数据位.非常感谢任何帮助或建议.
I'm really quite perplexed by this .. ! i can group this by ID,s on-off, dates, timestamps but I'm not sure how to work on rolling bits of data within a single column. Any help or advice is much appreciated.
DN.
推荐答案
这里有一种方法适用于一个简单的示例,即一台机器在一天的过程中在打开和关闭之间变化.无论机器在第一行处于 on
还是 off
状态,它都可以工作.
Here's an approach that works for a simple example of one machine that varies between on and off during the course of one day. It works regardless of whether the machine is in on
or off
state in the first row.
df = pd.DataFrame({'ID': [12, 12, 12, 12, 12],
'on/off': [0,0,1,0,1],
'Time': ['2019-01-01 00:00:00', '2019-01-01 00:00:05', '2019-01-01 00:00:10','2019-01-01 00:00:15','2019-01-01 00:00:20']
})
ID on/off Time
0 12 0 2019-01-01 00:00:00
1 12 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 0 2019-01-01 00:00:15
4 12 1 2019-01-01 00:00:20
- 首先我确定
Time
列的数据类型是 datetime64:
- First I made sure the
Time
column dtype is datetime64:
df['Time'] = pd.to_datetime(df['Time'])
- 然后我得到状态改变的所有行的索引(从
off
到on
,或从on
到关闭
:
- Then I get the indices of all rows where the state changed (either from
off
toon
, or fromon
tooff
:
s = df[df['on/off'].shift(1) != df['on/off']].index
df = df.loc[s]
- 然后我创建一个名为
time shift
的列,它显示电源状态发生变化的最近行的时间戳:
- Then I create a column called
time shift
, which shows the timestamp of the most recent row where power state changed:
df['time shift'] = df['Time'].shift(1)
此时数据框看起来像这样:
At this point the dataframe looks like this:
ID on/off Time time shift
0 12 0 2019-01-01 00:00:00 NaT
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
3 12 0 2019-01-01 00:00:15 2019-01-01 00:00:10
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
- 现在,由于我们想要计算机器关闭的持续时间,我只查看状态变为
on
的行索引:
- Now, since we want to count the duration that the machine was off, I look at only the row indices where the state became
on
:
r = df[df['on/off'] == 1].index
df = df.loc[r]
此时,数据框看起来如下所示.请注意,time shift
列显示的是机器最近关闭的时间点,在 Time
列中显示的时间之前,这是机器的时间戳重新开启.找出这两列之间的差异将告诉我们机器在白天关闭的每个持续时间的长度:
At this point, the dataframe looks as it does below. Notice that the time shift
column is displaying the point at which the machine most recently turned off, prior to the time being displayed in Time
column, which is the timestamp when the machine turned back on. Finding the difference between these two columns will give us the length of each duration that the machine was off during the day:
ID on/off Time time shift
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
- 以下行通过对机器处于
off
状态的每个时间段的持续时间求和来计算总关闭时间:
- The following line calculates total off-time, by summing the durations of each period that the machine was in its
off
state:
(df['Time'] - df['time shift']).sum()
输出:
Timedelta('0 days 00:00:15')
关于 Pandas .shift()
方法如何工作的一些附加上下文:
Some additional context on how the Pandas .shift()
method works:
Shift 取一列中的所有行,并将它们向前或向后移动一定量..shift(1)
告诉 pandas 将每一行的索引向前或向上移动 1. .shift(-1)
告诉 Pandas 将每行的索引移动行后退或向下 1.或者,.shift(1)
可让您查看前一行索引处的列值,而 .shift(-1)
允许您查看下一行索引处的列值,相对于列中的给定行.这是一种在不同行之间比较列值的便捷方法,无需借助 for 循环.
Shift takes all the rows in a column, and moves them either forward or back by a certain amount. .shift(1)
tells pandas to move the index of each row forward, or up, by 1. .shift(-1)
tells pandas to move the index of each row back, or down, by 1. Alternately put, .shift(1)
lets you look at the value of a column at the previous row index, and .shift(-1)
lets you look at the value of a column at the next row index, relative a given row in a column. It's a handy way to compare a column's values across different rows, without resorting to for-loops.
这篇关于将时间序列列与条件相加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!