我目前正在尝试在未知间隔时提取熊猫的间隔范围。
假设我有一个df,例如:

df = pd.DataFrame({'range': ['range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1','range1',
                             'range2','range2','range2','range2','range2','range2','range2','range2',
                             'range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3','range3'],
                   'pos1':[1,2,3,4,100,101,102,104,107,108,207,208,209,210,
                           10,11,12,50,51,52,54,55,
                           50,51,52,53,107,108,109,110,111,112,113,800,802,803,804,805]})


您会看到,在每个范围内,数字始终会增加,有时数字之间会有很大的跳跃。
我只是最终将输出写入文件,因此不需要将其作为数据框。我希望最终输出像

range1    1    4
range1    100  108
range1    207  210
range2    10   12
range2    50   55
range3    50   53
range3    107  113
range3    800  805


我试图这样做(很丑),但是我的输出缺少所有range2以及最后一个范围range1range3

ranges = []
tmp = []
for r1, r2, p1, p2 in zip(df['range'], df['range'][1:], df['pos1'], df['pos1'][1:]):
    if r1 == r2 and (p1+10 > p2):
        tmp.append(p1)
    elif r1 == r2 and (p1+10 < p2):
        tmp.append(p1)
        ranges.append((r1, tmp))
        tmp = []

f = open('ranges.txt', 'w')
for x in ranges:
    f.write(x[0]+'\t'+str(min(x[1]))+'\t'+str(max(x[1]))+'\n')


输出:

range1  1       4
range1  100     108
range3  50      53
range3  107     113

最佳答案

会做些类似的事情(您应该修改print命令以写入文件):

thresh = 10
s = df.groupby('range')['pos1'].diff().gt(thresh).cumsum()

for (r,g), d in df.groupby(['range',s])['pos1']:
    print(r, list(d))


输出:

range1 [1, 2, 3, 4]
range1 [100, 101, 102, 104, 107, 108]
range1 [207, 208, 209, 210]
range2 [10, 11, 12]
range2 [50, 51, 52, 54, 55]
range3 [50, 51, 52, 53]
range3 [107, 108, 109, 110, 111, 112, 113]
range3 [800, 802, 803, 804, 805]

关于python - 当范围未知时,Pandas groupby值的范围,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/60309613/

10-12 17:41
查看更多