我目前正在尝试在未知间隔时提取熊猫的间隔范围。
假设我有一个df,例如:
df = pd.DataFrame({'range': ['range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1',
'range2','range2','range2','range2','range2','range2','range2','range2',
'range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3'],
'pos1':[1,2,3,4,100,101,102,104,107,108,207,208,209,210,
10,11,12,50,51,52,54,55,
50,51,52,53,107,108,109,110,111,112,113,800,802,803,804,805]})
您会看到,在每个范围内,数字始终会增加,有时数字之间会有很大的跳跃。
我只是最终将输出写入文件,因此不需要将其作为数据框。我希望最终输出像
range1 1 4
range1 100 108
range1 207 210
range2 10 12
range2 50 55
range3 50 53
range3 107 113
range3 800 805
我试图这样做(很丑),但是我的输出缺少所有
range2
以及最后一个范围range1
和range3
。ranges = []
tmp = []
for r1, r2, p1, p2 in zip(df['range'], df['range'][1:], df['pos1'], df['pos1'][1:]):
if r1 == r2 and (p1+10 > p2):
tmp.append(p1)
elif r1 == r2 and (p1+10 < p2):
tmp.append(p1)
ranges.append((r1, tmp))
tmp = []
f = open('ranges.txt', 'w')
for x in ranges:
f.write(x[0]+'\t'+str(min(x[1]))+'\t'+str(max(x[1]))+'\n')
输出:
range1 1 4
range1 100 108
range3 50 53
range3 107 113
最佳答案
会做些类似的事情(您应该修改print
命令以写入文件):
thresh = 10
s = df.groupby('range')['pos1'].diff().gt(thresh).cumsum()
for (r,g), d in df.groupby(['range',s])['pos1']:
print(r, list(d))
输出:
range1 [1, 2, 3, 4]
range1 [100, 101, 102, 104, 107, 108]
range1 [207, 208, 209, 210]
range2 [10, 11, 12]
range2 [50, 51, 52, 54, 55]
range3 [50, 51, 52, 53]
range3 [107, 108, 109, 110, 111, 112, 113]
range3 [800, 802, 803, 804, 805]
关于python - 当范围未知时,Pandas groupby值的范围,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/60309613/