我有一些轨迹是由集群之间的移动创建的,例如:

user_id,trajectory
11011,[[[86], [110], [110]]
2139671,[[89], [125]]
3945641,[[36], [73], [110], [110]]
10024312,[[123], [27], [97], [97], [97], [110]]
14270422,[[0], [110], [174]]
14283758,[[110], [184]]
14317445,[[50], [88]]
14331818,[[0], [22], [36], [131], [131]]
14334591,[[107], [19]]
14373703,[[35], [97], [97], [97], [17], [58]]

我想将多个动作的轨迹分割成单独的片段,但我不确定如何分割。
例子:
14373703,[[35], [97], [97], [97], [17], [58]]

进入之内
14373703,[[35,97], [97,97], [97,17], [17,58]]

其目的是利用这些边作为NetworkX中的边,将它们作为图进行分析,并识别各个簇(节点)之间的密集移动(边)。
这是我最初用来创建轨迹的代码:
# Import Data
data = pd.read_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_outputs.csv', delimiter=',', engine='python')
#print len(data),"rows"

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels'])

# Filter Data Frame by count of user_id
filtered = df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)
#filtered.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_final_filtered.csv', index=False, header=True)

# Get a list of unique user_id values
uniqueIds = np.unique(filtered['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,filtered.loc[filtered['user_id']==id].sort_values(by='timestamp')[['cluster_labels']].values.tolist()] for id in uniqueIds]

# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
headers = ['user_id','trajectory']
outputs.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_moves.csv', index=False, header=headers)

如果这样拆分是可能的,那么可以在处理期间而不是事后完成吗我想在创建时执行它,以消除任何后处理。

最佳答案

我认为您可以将groupbyapply一起使用,将自定义函数与zip一起使用,以便在必要的列表理解中输出列表:
注意:
count函数返回所有的noNaN值,如果通过不带nan better的length过滤是len

#filtering and sorting
filtered = df.groupby('user_id').filter(lambda x: len(x['user_id'])>1)
filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
print (df2)
    user_id                                     cluster_labels
0     11011                            [[86, 110], [110, 110]]
1   2139671                                        [[89, 125]]
2   3945641                  [[36, 73], [73, 110], [110, 110]]
3  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
4  14270422                             [[0, 110], [110, 174]]
5  14283758                                       [[110, 184]]
6  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...

类似的解决方案,过滤是最后一步:
filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
df2 = df2[df2['cluster_labels'].str.len() > 0]
print (df2)
    user_id                                     cluster_labels
1     11011                            [[86, 110], [110, 110]]
2   2139671                                        [[89, 125]]
3   3945641                  [[36, 73], [73, 110], [110, 110]]
4  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
5  14270422                             [[0, 110], [110, 174]]
6  14283758                                       [[110, 184]]
7  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...

10-07 13:29