因此,在此示例中,行 2 和 3 将组合在一起,行 8 和 9 会聚在一起.我尝试使用diff或相关函数,但是我没有弄清楚.任何帮助将不胜感激!解决方案使用 diff 是正确的方法-只需将其与 gt 和 cumsum组合,您就有了自己的小组.这个想法是对大于您的阈值的差异使用累计和.大于阈值的差异将变为 True .相反,等于或低于阈值的差异将变为 False .累计布尔值将使等于或低于阈值的差异保持不变,因此它们具有相同的组号. max_distance = 1df ["group_diff"] = df.sort_values("val")\.groupby("group_number")["val"] \.diff()\.gt(max_distance)\.cumsum()打印(df)group_number val group_diff0 1 5 01 1 8 12 1 12 23 1 13 24 1 22 55 1 26 66 1 31 87 2 7 08 2 16 39 2 17 310 2 19 411 2 29 712 2 33 913 2 62 10 您现在可以在 group_number 和 group_diff 上使用 groupby ,并使用以下内容查看生成的组: grouped = df.groupby(["group_number","group_diff"])打印(grouped.groups){(1,0):Int64Index([0],dtype ='int64'),(1,1):Int64Index([1],dtype ='int64'),(1,2):Int64Index([2,3],dtype ='int64'),(1,5):Int64Index([4],dtype ='int64'),(1,6):Int64Index([5],dtype ='int64'),(1,8):Int64Index([6],dtype ='int64'),(2,0):Int64Index([7],dtype ='int64'),(2,3):Int64Index([8,9],dtype ='int64'),(2,4):Int64Index([10],dtype ='int64'),(2,7):Int64Index([11],dtype ='int64'),(2,9):Int64Index([12],dtype ='int64'),(2,10):Int64Index([13],dtype ='int64')} 感谢@jezrael避免使用新列来提高性能的提示: group_diff = df.sort_values("val")\.groupby("group_number")["val"] \.diff()\.gt(max_distance)\.cumsum()分组= df.groupby(["group_number",group_diff]) I have a dataframe where I need to group elements with distance of no more than 1.For example, if this is my df: group_number val0 1 51 1 82 1 123 1 134 1 225 1 266 1 317 2 78 2 169 2 1710 2 1911 2 2912 2 3313 2 62So I need to group both by the group_number and val where the values of val are smaller than or equal to 1.So, in this example, lines 2 and 3 would group together, and also lines 8 and 9 would group together.I tried using diff or related functions, but I didn't figure it out.Any help will be appreciated! 解决方案 Using diff is the right approach - just combine it with gt and cumsum and you have your groups.The idea is to use cumulative sum for differences bigger than your threshold. Difference larger than your threshold will become True. In contrast, differences equal or lower to your threshold will become False. Cumulatively summing over the boolean values will leave differences equal or lower to your threshold unchanged and hence they get the same group number.max_distance = 1df["group_diff"] = df.sort_values("val")\ .groupby("group_number")["val"]\ .diff()\ .gt(max_distance)\ .cumsum()print(df) group_number val group_diff0 1 5 01 1 8 12 1 12 23 1 13 24 1 22 55 1 26 66 1 31 87 2 7 08 2 16 39 2 17 310 2 19 411 2 29 712 2 33 913 2 62 10You can now use groupby on group_number and group_diff and see the resulting groups with the following:grouped = df.groupby(["group_number", "group_diff"])print(grouped.groups){(1, 0): Int64Index([0], dtype='int64'), (1, 1): Int64Index([1], dtype='int64'), (1, 2): Int64Index([2, 3], dtype='int64'), (1, 5): Int64Index([4], dtype='int64'), (1, 6): Int64Index([5], dtype='int64'), (1, 8): Int64Index([6], dtype='int64'), (2, 0): Int64Index([7], dtype='int64'), (2, 3): Int64Index([8, 9], dtype='int64'), (2, 4): Int64Index([10], dtype='int64'), (2, 7): Int64Index([11], dtype='int64'), (2, 9): Int64Index([12], dtype='int64'), (2, 10): Int64Index([13], dtype='int64')}Thanks @jezrael for the hint of avoiding a new column to increase performance:group_diff = df.sort_values("val")\ .groupby("group_number")["val"]\ .diff()\ .gt(max_distance)\ .cumsum()grouped = df.groupby(["group_number", group_diff]) 这篇关于Python Pandas-如何分组关闭元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 1403页,肝出来的.. 09-08 10:46