我有一组数据点,这些数据点应该位于一个位点上并遵循某种模式,但是由于我需要一个整洁的位点,以便稍后进行分析,因此我想丢弃一些主要位点上的散点。蓝点越来越少地是我要查找的散点,而无需手动进行操作即可通过复杂的方式将其排除。

我当时在考虑使用类似Nearest Neighbors Regression的方法,但是我不确定这是否是最好的方法,或者我不太熟悉应该如何实施它才能给我合适的结果。顺便说一句,我想做的没有任何合适的过程。
数据的转置版本如下:

X=array([[ 0.87 , -0.01 ,  0.575,  1.212,  0.382,  0.418, -0.01 ,  0.474,
         0.432,  0.702,  0.574,  0.45 ,  0.334,  0.565,  0.414,  0.873,
         0.381,  1.103,  0.848,  0.503,  0.27 ,  0.416,  0.939,  1.211,
         1.106,  0.321,  0.709,  0.744,  0.309,  0.247,  0.47 , -0.107,
         0.925,  1.127,  0.833,  0.963,  0.385,  0.572,  0.437,  0.577,
         0.461,  0.474,  1.046,  0.892,  0.313,  1.009,  1.048,  0.349,
         1.189,  0.302,  0.278,  0.629,  0.36 ,  1.188,  0.273,  0.191,
        -0.068,  0.95 ,  1.044,  0.776,  0.726,  1.035,  0.817,  0.55 ,
         0.387,  0.476,  0.473,  0.863,  0.252,  0.664,  0.365,  0.244,
         0.238,  1.203,  0.339,  0.528,  0.326,  0.347,  0.385,  1.139,
         0.748,  0.879,  0.324,  0.265,  0.328,  0.815,  0.38 ,  0.884,
         0.571,  0.416,  0.485,  0.683,  0.496,  0.488,  1.204,  1.18 ,
         0.465,  0.34 ,  0.335,  0.447,  0.28 ,  1.02 ,  0.519,  0.335,
         1.037,  1.126,  0.323,  0.452,  0.201,  0.321,  0.285,  0.587,
         0.292,  0.228,  0.303,  0.844,  0.229,  1.077,  0.864,  0.515,
         0.071,  0.346,  0.255,  0.88 ,  0.24 ,  0.533,  0.725,  0.339,
         0.546,  0.841,  0.43 ,  0.568,  0.311,  0.401,  0.212,  0.691,
         0.565,  0.292,  0.295,  0.587,  0.545,  0.817,  0.324,  0.456,
         0.267,  0.226,  0.262,  0.338,  1.124,  0.373,  0.814,  1.241,
         0.661,  0.229,  0.416,  1.103,  0.226,  1.168,  0.616,  0.593,
         0.803,  1.124,  0.06 ,  0.573,  0.664,  0.882,  0.286,  0.139,
         1.095,  1.112,  1.167,  0.589,  0.3  ,  0.578,  0.727,  0.252,
         0.174,  0.317,  0.427,  1.184,  0.397,  0.43 ,  0.229,  0.261,
         0.632,  0.938,  0.576,  0.37 ,  0.497,  0.54 ,  0.306,  0.315,
         0.335,  0.24 ,  0.344,  0.93 ,  0.134,  0.4  ,  0.223,  1.224,
         1.187,  1.031,  0.25 ,  0.53 , -0.147,  0.087,  0.374,  0.496,
         0.441,  0.884,  0.971,  0.749,  0.432,  0.582,  0.198,  0.615,
         1.146,  0.475,  0.595,  0.304,  0.416,  0.645,  0.281,  0.576,
         1.139,  0.316,  0.892,  0.648,  0.826,  0.299,  0.381,  0.926,
         0.606],
       [-0.154, -0.392, -0.262,  0.214, -0.403, -0.363, -0.461, -0.326,
        -0.349, -0.21 , -0.286, -0.358, -0.436, -0.297, -0.394, -0.166,
        -0.389,  0.029, -0.124, -0.335, -0.419, -0.373, -0.121,  0.358,
         0.042, -0.408, -0.189, -0.213, -0.418, -0.479, -0.303, -0.645,
        -0.153,  0.098, -0.171, -0.066, -0.368, -0.273, -0.329, -0.295,
        -0.362, -0.305, -0.052, -0.171, -0.406, -0.102,  0.011, -0.375,
         0.126, -0.411, -0.42 , -0.27 , -0.407,  0.144, -0.419, -0.465,
        -0.036, -0.099,  0.007, -0.167, -0.205, -0.011, -0.151, -0.267,
        -0.368, -0.342, -0.299, -0.143, -0.42 , -0.232, -0.368, -0.417,
        -0.432,  0.171, -0.388, -0.319, -0.407, -0.379, -0.353,  0.043,
        -0.211, -0.14 , -0.373, -0.431, -0.383, -0.142, -0.345, -0.144,
        -0.302, -0.38 , -0.337, -0.2  , -0.321, -0.269,  0.406,  0.223,
        -0.322, -0.395, -0.379, -0.324, -0.424,  0.01 , -0.298, -0.386,
         0.018,  0.157, -0.384, -0.327, -0.442, -0.388, -0.387, -0.272,
        -0.397, -0.415, -0.388, -0.106, -0.504,  0.034, -0.153, -0.32 ,
        -0.271, -0.417, -0.417, -0.136, -0.447, -0.279, -0.225, -0.372,
        -0.316, -0.161, -0.331, -0.261, -0.409, -0.338, -0.437, -0.242,
        -0.328, -0.403, -0.433, -0.274, -0.331, -0.163, -0.361, -0.298,
        -0.392, -0.447, -0.429, -0.388,  0.11 , -0.348, -0.174,  0.244,
        -0.182, -0.424, -0.319,  0.088, -0.547,  0.189, -0.216, -0.228,
        -0.17 ,  0.125, -0.073, -0.266, -0.234, -0.108, -0.395, -0.395,
         0.131,  0.074,  0.514, -0.235, -0.389, -0.288, -0.22 , -0.416,
        -0.777, -0.358, -0.31 ,  0.817, -0.363, -0.328, -0.424, -0.416,
        -0.248, -0.093, -0.28 , -0.357, -0.348, -0.298, -0.384, -0.394,
        -0.362, -0.415, -0.349, -0.08 , -0.572, -0.07 , -0.423,  0.359,
         0.4  ,  0.099, -0.426, -0.252, -0.697, -0.508, -0.348, -0.254,
        -0.307, -0.116, -0.029, -0.201, -0.302, -0.25 , -0.44 , -0.233,
         0.274, -0.295, -0.223, -0.398, -0.298, -0.209, -0.389, -0.247,
         0.225, -0.395, -0.124, -0.237, -0.104, -0.361, -0.335, -0.083,
        -0.254]])

最佳答案

我想提供以下过程,但不一定是一个完美的答案,而是作为您进行升级或基于此开发类似产品的初创公司。

怎么了?该过程将这些点分别对应于它们的x值。对于每个组(箱),将计算平均y值,并丢弃偏差最大且超过某个预定义限制的点。然后再次计算平均值,依此类推。如果没有更多点要丢弃,则考虑下一个容器。您会在代码中找到注释(希望)能给出更清晰的解释。

这是代码:

def discard(X):
    """
    Group points together in x-bins; discard points for every bin which deviate more than dy from the average in an iterative procedure.
    """

    dx, dy = 0.1, 0.1   # dx: bin size; dy: max. deviation

    points = sorted(zip(X[0], X[1]), key=lambda p: p[0])    # sort the points respective to x

    xx = points[0][0]   # the smallest x-value
    xmax = points[-1][0]    # the greatest x-value

    while xx < xmax:    # loop over all bins
        loop = True
        while loop:
            tmp = [p for p in points if p[0] >= xx and p[0] < xx+dx]    # all points in the current bin
            try:
                av = sum([p[1] for p in tmp]) / len(tmp)    # the average y-value
            except ZeroDivisionError:    # no points within this bin, continue with next bin
                break
            dev = sorted([p for p in tmp if abs(p[1]-av) > dy], key=lambda p: abs(p[1]-av)) # all points which deviate more than dy from the average sorted by their deviation
            try:
                points.remove(dev[-1])  # discard the point with the greatest deviation
            except IndexError:
                loop = False    # if no point is deviating more than dy continue with the next bin
        xx += dx

    return [ [p[0] for p in points], [p[1] for p in points] ]



结果显然取决于dxdy的选择。以下是一些示例(蓝点分别被丢弃)。对于dx, dy = 0.1, 0.1



如您所见,由于图形的斜率较大(因此最好使用更大的dy),因此在右尾有很多点被丢弃。

对于dx, dy = 0.10, 0.15



由于使用了较大的dy,因此在这种情况下将丢弃较少的点。但是,重要的是要确保每个容器中都包含足够的点,否则该过程可能会失败并丢弃错误的点,如您在下一个图的左尾(对于dx, dy = 0.09, 0.15)所观察到的:



那么,从您知道使用哪个dx, dy呢?最好的解决方案可能是使它们保持可变。例如:

选择dx使得每个容器中都有一定数量的最小点数,以避免如上一个示例中的不良丢弃。
可以从曲线的斜率计算dy,即两个相邻bin的平均值之差除以它们bin中心的差。因此,更大的斜率导致更大的dy

此过程在某种程度上类似于最近的邻居算法,但是它不会例如检查每个点。这也是进行修改的空间:您可以选择n而不是选择dx最近的邻居,以使间隔[x-dx, x+dx]包含n点,并应用上述过程。

使用实际的最近邻居算法可能会出现问题,因为您仅考虑点沿y轴的偏差,因此强烈赞成x值接近参考点的点。

关于python - 从特征中排除分散点,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25645459/

10-13 02:07