我有一个数据框,里面有这样的对话和时间戳:

timestamp   userID  textBlob    new_id
2018-10-05 23:07:02 01  a large text blob...
2018-10-05 23:07:13 01  a large text blob...
2018-10-05 23:07:23 01  a large text blob...
2018-10-05 23:07:36 01  a large text blob...
2018-10-05 23:08:02 01  a large text blob...
2018-10-05 23:09:16 01  a large text blob...
2018-10-05 23:09:21 01  a large text blob...
2018-10-05 23:09:39 01  a large text blob...
2018-10-05 23:09:47 01  a large text blob...
2018-10-05 23:10:01 01  a large text blob...
2018-10-05 23:10:11 01  a large text blob...
2018-10-05 23:10:23 01  restart
2018-10-05 23:10:59 01  a large text blob...
2018-10-05 23:11:03 01  a large text blob...
2018-10-08 23:11:32 02  a large text blob...
2018-10-08 23:12:58 02  a large text blob...
2018-10-08 23:13:16 02  a large text blob...
2018-10-08 23:14:04 02  a large text blob...
2018-10-08 03:38:36 02  a large text blob...
2018-10-08 03:38:42 02  a large text blob...
2018-10-08 03:38:52 02  a large text blob...
2018-10-08 03:38:57 02  a large text blob...
2018-10-08 03:39:10 02  a large text blob...
2018-10-08 03:39:27 02  Restart
2018-10-08 03:40:47 02  a large text blob...
2018-10-08 03:40:54 02  a large text blob...
2018-10-08 03:41:02 02  a large text blob...
2018-10-08 03:41:12 02  a large text blob...
2018-10-08 03:41:32 02  a large text blob...
2018-10-08 03:41:39 02  a large text blob...
2018-10-08 03:42:20 02  a large text blob...
2018-10-08 03:44:58 02  a large text blob...
2018-10-08 03:45:54 02  a large text blob...
2018-10-08 03:46:06 02  a large text blob...
2018-10-08 05:06:42 03  a large text blob...
2018-10-08 05:06:53 03  a large text blob...
2018-10-08 05:08:49 03  a large text blob...
2018-10-08 05:08:58 03  a large text blob...
2018-10-08 05:58:18 04  a large text blob...
2018-10-08 05:58:26 04  a large text blob...
2018-10-08 05:58:37 04  a large text blob...
2018-10-08 05:58:58 04  a large text blob...
2018-10-08 06:00:31 04  a large text blob...
2018-10-08 06:01:00 04  a large text blob...
2018-10-08 06:01:14 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:06:03 04  a large text blob...
2018-10-08 06:10:00 04  a large text blob...
2018-10-08 09:07:03 04  a large text blob...
2018-10-08 09:09:03 04  a large text blob...
2018-10-09 10:01:00 04  a large text blob...
2018-10-09 10:02:00 04  a large text blob...
2018-10-09 10:03:00 04  a large text blob...
2018-10-09 10:09:00 04  a large text blob...
2018-10-09 10:09:00 05  a large text blob...

目前,我想用一个id标识数据框内的对话。问题是,用户可以有多个对话(即userID可以有多个textBlob关联)。因此,我想添加一个new_id以便能够识别上述数据帧中的对话。
为此,我想基于三个条件创建一个new_id列:
10分钟
关键字的出现
当用户没有更多的textblob时
预期输出如下所示:
timestamp   userID  textBlob    new_id
2018-10-05 23:07:02 01  a large text blob...    001
2018-10-05 23:07:13 01  a large text blob...    001
2018-10-05 23:07:23 01  a large text blob...    001
2018-10-05 23:07:36 01  a large text blob...    001
2018-10-05 23:08:02 01  a large text blob...    001
2018-10-05 23:09:16 01  a large text blob...    001
2018-10-05 23:09:21 01  a large text blob...    001
2018-10-05 23:09:39 01  a large text blob...    001
2018-10-05 23:09:47 01  a large text blob...    001
2018-10-05 23:10:01 01  a large text blob...    001
2018-10-05 23:10:11 01  a large text blob...    001
2018-10-05 23:10:23 01  restart                 001   ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01  a large text blob...    002
2018-10-05 23:11:03 01  a large text blob...    002
2018-10-08 23:11:32 02  a large text blob...    002
2018-10-08 23:12:58 02  a large text blob...    002
2018-10-08 23:13:16 02  a large text blob...    002
2018-10-08 23:14:04 02  a large text blob...    002   --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02  a large text blob...    003
2018-10-08 03:38:42 02  a large text blob...    003
2018-10-08 03:38:52 02  a large text blob...    003
2018-10-08 03:38:57 02  a large text blob...    003
2018-10-08 03:39:10 02  a large text blob...    003
2018-10-08 03:39:27 02  Restart                 003   ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02  a large text blob...    004
2018-10-08 03:40:54 02  a large text blob...    004
2018-10-08 03:41:02 02  a large text blob...    004
2018-10-08 03:41:12 02  a large text blob...    004
2018-10-08 03:41:32 02  a large text blob...    004
2018-10-08 03:41:39 02  a large text blob...    004
2018-10-08 03:42:20 02  a large text blob...    004
2018-10-08 03:44:58 02  a large text blob...    004
2018-10-08 03:45:54 02  a large text blob...    004
2018-10-08 03:46:06 02  a large text blob...    004     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03  a large text blob...    005
2018-10-08 05:06:53 03  a large text blob...    005
2018-10-08 05:08:49 03  a large text blob...    005
2018-10-08 05:08:58 03  a large text blob...    005     ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04  a large text blob...    006
2018-10-08 05:58:26 04  a large text blob...    006
2018-10-08 05:58:37 04  a large text blob...    006
2018-10-08 05:58:58 04  a large text blob...    006
2018-10-08 06:00:31 04  a large text blob...    006
2018-10-08 06:01:00 04  a large text blob...    006
2018-10-08 06:01:14 04  a large text blob...    006
2018-10-08 06:02:03 04  a large text blob...    006     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04  a large text blob...    007
2018-10-08 06:06:03 04  a large text blob...    007
2018-10-08 06:10:00 04  a large text blob...    007
2018-10-08 09:07:03 04  a large text blob...    007
2018-10-08 09:09:03 04  a large text blob...    007     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04  a large text blob...    008
2018-10-09 10:02:00 04  a large text blob...    008
2018-10-09 10:03:00 04  a large text blob...    008
2018-10-09 10:09:00 04  a large text blob...    008     ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05  a large text blob...    010

到目前为止,我试图:
searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

以及
dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times

但是,我还需要考虑用户id,最后我得到了几个列。有没有办法满足这三个条件并获得预期的产出?

最佳答案

你大部分时间都在那里。要将所有这些放在一起,请为每个条件构建一个布尔掩码,然后将这些掩码转换为int并获取它们的累积和:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm')
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp  userID              textBlob  new_id
2018-10-05 23:07:02       1  a_large_text_blob...       1
2018-10-05 23:07:13       1  a_large_text_blob...       1
2018-10-05 23:07:23       1  a_large_text_blob...       1
2018-10-05 23:07:36       1  a_large_text_blob...       1
2018-10-05 23:08:02       1  a_large_text_blob...       1
2018-10-05 23:09:16       1  a_large_text_blob...       1
2018-10-05 23:09:21       1  a_large_text_blob...       1
2018-10-05 23:09:39       1  a_large_text_blob...       1
2018-10-05 23:09:47       1  a_large_text_blob...       1
2018-10-05 23:10:01       1  a_large_text_blob...       1
2018-10-05 23:10:11       1  a_large_text_blob...       1
2018-10-05 23:10:23       1               restart       1
2018-10-05 23:10:59       1  a_large_text_blob...       2
2018-10-05 23:11:03       1  a_large_text_blob...       2
2018-10-08 03:11:32       2  a_large_text_blob...       3
2018-10-08 03:12:58       2  a_large_text_blob...       3
2018-10-08 03:13:16       2  a_large_text_blob...       3
2018-10-08 03:14:04       2  a_large_text_blob...       3
2018-10-08 03:38:36       2  a_large_text_blob...       4
2018-10-08 03:38:42       2  a_large_text_blob...       4
2018-10-08 03:38:52       2  a_large_text_blob...       4
2018-10-08 03:38:57       2  a_large_text_blob...       4
2018-10-08 03:39:10       2  a_large_text_blob...       4
2018-10-08 03:39:27       2               Restart       4
2018-10-08 03:40:47       2  a_large_text_blob...       5
2018-10-08 03:40:54       2  a_large_text_blob...       5
2018-10-08 03:41:02       2  a_large_text_blob...       5
2018-10-08 03:41:12       2  a_large_text_blob...       5
2018-10-08 03:41:32       2  a_large_text_blob...       5
2018-10-08 03:41:39       2  a_large_text_blob...       5
2018-10-08 03:42:20       2  a_large_text_blob...       5
2018-10-08 03:44:58       2  a_large_text_blob...       5
2018-10-08 03:45:54       2  a_large_text_blob...       5
2018-10-08 03:46:06       2  a_large_text_blob...       5
2018-10-08 05:06:42       3  a_large_text_blob...       6
2018-10-08 05:06:53       3  a_large_text_blob...       6
2018-10-08 05:08:49       3  a_large_text_blob...       6
2018-10-08 05:08:58       3  a_large_text_blob...       6
2018-10-08 05:58:18       4  a_large_text_blob...       7
2018-10-08 05:58:26       4  a_large_text_blob...       7
2018-10-08 05:58:37       4  a_large_text_blob...       7
2018-10-08 05:58:58       4  a_large_text_blob...       7
2018-10-08 06:00:31       4  a_large_text_blob...       7
2018-10-08 06:01:00       4  a_large_text_blob...       7
2018-10-08 06:01:14       4  a_large_text_blob...       7
2018-10-08 06:02:03       4  a_large_text_blob...       7
2018-10-08 06:02:03       4  a_large_text_blob...       7
2018-10-08 06:06:03       4  a_large_text_blob...       7
2018-10-08 06:10:00       4  a_large_text_blob...       7
2018-10-08 09:07:03       4  a_large_text_blob...       8
2018-10-08 09:09:03       4  a_large_text_blob...       8
2018-10-09 10:01:00       4  a_large_text_blob...       9
2018-10-09 10:02:00       4  a_large_text_blob...       9
2018-10-09 10:03:00       4  a_large_text_blob...       9
2018-10-09 10:09:00       4  a_large_text_blob...       9
2018-10-09 10:09:00       5  a_large_text_blob...      10

关于python - 尝试根据三个条件创建新的id列时遇到问题?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/53248280/

10-12 18:59