我有一组数据点,这些数据点应该位于一个位点上并遵循某种模式,但是由于我需要一个整洁的位点,以便稍后进行分析,因此我想丢弃一些主要位点上的散点。蓝点越来越少地是我要查找的散点,而无需手动进行操作即可通过复杂的方式将其排除。
我当时在考虑使用类似Nearest Neighbors Regression
的方法,但是我不确定这是否是最好的方法,或者我不太熟悉应该如何实施它才能给我合适的结果。顺便说一句,我想做的没有任何合适的过程。
数据的转置版本如下:
X=array([[ 0.87 , -0.01 , 0.575, 1.212, 0.382, 0.418, -0.01 , 0.474,
0.432, 0.702, 0.574, 0.45 , 0.334, 0.565, 0.414, 0.873,
0.381, 1.103, 0.848, 0.503, 0.27 , 0.416, 0.939, 1.211,
1.106, 0.321, 0.709, 0.744, 0.309, 0.247, 0.47 , -0.107,
0.925, 1.127, 0.833, 0.963, 0.385, 0.572, 0.437, 0.577,
0.461, 0.474, 1.046, 0.892, 0.313, 1.009, 1.048, 0.349,
1.189, 0.302, 0.278, 0.629, 0.36 , 1.188, 0.273, 0.191,
-0.068, 0.95 , 1.044, 0.776, 0.726, 1.035, 0.817, 0.55 ,
0.387, 0.476, 0.473, 0.863, 0.252, 0.664, 0.365, 0.244,
0.238, 1.203, 0.339, 0.528, 0.326, 0.347, 0.385, 1.139,
0.748, 0.879, 0.324, 0.265, 0.328, 0.815, 0.38 , 0.884,
0.571, 0.416, 0.485, 0.683, 0.496, 0.488, 1.204, 1.18 ,
0.465, 0.34 , 0.335, 0.447, 0.28 , 1.02 , 0.519, 0.335,
1.037, 1.126, 0.323, 0.452, 0.201, 0.321, 0.285, 0.587,
0.292, 0.228, 0.303, 0.844, 0.229, 1.077, 0.864, 0.515,
0.071, 0.346, 0.255, 0.88 , 0.24 , 0.533, 0.725, 0.339,
0.546, 0.841, 0.43 , 0.568, 0.311, 0.401, 0.212, 0.691,
0.565, 0.292, 0.295, 0.587, 0.545, 0.817, 0.324, 0.456,
0.267, 0.226, 0.262, 0.338, 1.124, 0.373, 0.814, 1.241,
0.661, 0.229, 0.416, 1.103, 0.226, 1.168, 0.616, 0.593,
0.803, 1.124, 0.06 , 0.573, 0.664, 0.882, 0.286, 0.139,
1.095, 1.112, 1.167, 0.589, 0.3 , 0.578, 0.727, 0.252,
0.174, 0.317, 0.427, 1.184, 0.397, 0.43 , 0.229, 0.261,
0.632, 0.938, 0.576, 0.37 , 0.497, 0.54 , 0.306, 0.315,
0.335, 0.24 , 0.344, 0.93 , 0.134, 0.4 , 0.223, 1.224,
1.187, 1.031, 0.25 , 0.53 , -0.147, 0.087, 0.374, 0.496,
0.441, 0.884, 0.971, 0.749, 0.432, 0.582, 0.198, 0.615,
1.146, 0.475, 0.595, 0.304, 0.416, 0.645, 0.281, 0.576,
1.139, 0.316, 0.892, 0.648, 0.826, 0.299, 0.381, 0.926,
0.606],
[-0.154, -0.392, -0.262, 0.214, -0.403, -0.363, -0.461, -0.326,
-0.349, -0.21 , -0.286, -0.358, -0.436, -0.297, -0.394, -0.166,
-0.389, 0.029, -0.124, -0.335, -0.419, -0.373, -0.121, 0.358,
0.042, -0.408, -0.189, -0.213, -0.418, -0.479, -0.303, -0.645,
-0.153, 0.098, -0.171, -0.066, -0.368, -0.273, -0.329, -0.295,
-0.362, -0.305, -0.052, -0.171, -0.406, -0.102, 0.011, -0.375,
0.126, -0.411, -0.42 , -0.27 , -0.407, 0.144, -0.419, -0.465,
-0.036, -0.099, 0.007, -0.167, -0.205, -0.011, -0.151, -0.267,
-0.368, -0.342, -0.299, -0.143, -0.42 , -0.232, -0.368, -0.417,
-0.432, 0.171, -0.388, -0.319, -0.407, -0.379, -0.353, 0.043,
-0.211, -0.14 , -0.373, -0.431, -0.383, -0.142, -0.345, -0.144,
-0.302, -0.38 , -0.337, -0.2 , -0.321, -0.269, 0.406, 0.223,
-0.322, -0.395, -0.379, -0.324, -0.424, 0.01 , -0.298, -0.386,
0.018, 0.157, -0.384, -0.327, -0.442, -0.388, -0.387, -0.272,
-0.397, -0.415, -0.388, -0.106, -0.504, 0.034, -0.153, -0.32 ,
-0.271, -0.417, -0.417, -0.136, -0.447, -0.279, -0.225, -0.372,
-0.316, -0.161, -0.331, -0.261, -0.409, -0.338, -0.437, -0.242,
-0.328, -0.403, -0.433, -0.274, -0.331, -0.163, -0.361, -0.298,
-0.392, -0.447, -0.429, -0.388, 0.11 , -0.348, -0.174, 0.244,
-0.182, -0.424, -0.319, 0.088, -0.547, 0.189, -0.216, -0.228,
-0.17 , 0.125, -0.073, -0.266, -0.234, -0.108, -0.395, -0.395,
0.131, 0.074, 0.514, -0.235, -0.389, -0.288, -0.22 , -0.416,
-0.777, -0.358, -0.31 , 0.817, -0.363, -0.328, -0.424, -0.416,
-0.248, -0.093, -0.28 , -0.357, -0.348, -0.298, -0.384, -0.394,
-0.362, -0.415, -0.349, -0.08 , -0.572, -0.07 , -0.423, 0.359,
0.4 , 0.099, -0.426, -0.252, -0.697, -0.508, -0.348, -0.254,
-0.307, -0.116, -0.029, -0.201, -0.302, -0.25 , -0.44 , -0.233,
0.274, -0.295, -0.223, -0.398, -0.298, -0.209, -0.389, -0.247,
0.225, -0.395, -0.124, -0.237, -0.104, -0.361, -0.335, -0.083,
-0.254]])
最佳答案
我想提供以下过程,但不一定是一个完美的答案,而是作为您进行升级或基于此开发类似产品的初创公司。
怎么了?该过程将这些点分别对应于它们的x值。对于每个组(箱),将计算平均y值,并丢弃偏差最大且超过某个预定义限制的点。然后再次计算平均值,依此类推。如果没有更多点要丢弃,则考虑下一个容器。您会在代码中找到注释(希望)能给出更清晰的解释。
这是代码:
def discard(X):
"""
Group points together in x-bins; discard points for every bin which deviate more than dy from the average in an iterative procedure.
"""
dx, dy = 0.1, 0.1 # dx: bin size; dy: max. deviation
points = sorted(zip(X[0], X[1]), key=lambda p: p[0]) # sort the points respective to x
xx = points[0][0] # the smallest x-value
xmax = points[-1][0] # the greatest x-value
while xx < xmax: # loop over all bins
loop = True
while loop:
tmp = [p for p in points if p[0] >= xx and p[0] < xx+dx] # all points in the current bin
try:
av = sum([p[1] for p in tmp]) / len(tmp) # the average y-value
except ZeroDivisionError: # no points within this bin, continue with next bin
break
dev = sorted([p for p in tmp if abs(p[1]-av) > dy], key=lambda p: abs(p[1]-av)) # all points which deviate more than dy from the average sorted by their deviation
try:
points.remove(dev[-1]) # discard the point with the greatest deviation
except IndexError:
loop = False # if no point is deviating more than dy continue with the next bin
xx += dx
return [ [p[0] for p in points], [p[1] for p in points] ]
结果显然取决于
dx
和dy
的选择。以下是一些示例(蓝点分别被丢弃)。对于dx, dy = 0.1, 0.1
:如您所见,由于图形的斜率较大(因此最好使用更大的
dy
),因此在右尾有很多点被丢弃。对于
dx, dy = 0.10, 0.15
:由于使用了较大的
dy
,因此在这种情况下将丢弃较少的点。但是,重要的是要确保每个容器中都包含足够的点,否则该过程可能会失败并丢弃错误的点,如您在下一个图的左尾(对于dx, dy = 0.09, 0.15
)所观察到的:那么,从您知道使用哪个
dx, dy
呢?最好的解决方案可能是使它们保持可变。例如:选择
dx
使得每个容器中都有一定数量的最小点数,以避免如上一个示例中的不良丢弃。可以从曲线的斜率计算
dy
,即两个相邻bin的平均值之差除以它们bin中心的差。因此,更大的斜率导致更大的dy
。此过程在某种程度上类似于最近的邻居算法,但是它不会例如检查每个点。这也是进行修改的空间:您可以选择
n
而不是选择dx
最近的邻居,以使间隔[x-dx, x+dx]
包含n
点,并应用上述过程。使用实际的最近邻居算法可能会出现问题,因为您仅考虑点沿y轴的偏差,因此强烈赞成x值接近参考点的点。
关于python - 从特征中排除分散点,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25645459/