问题描述
我正在使用 sklearn 库对某些数据进行一些 PCA 分析.然后我正在绘制我的 PC1 和 PC2 分数的散点图,并使用此链接上的答案作为我的参考将 95% 置信度椭圆添加到同一图上 PCA Hotelling 的 95% Python 然后我使用 pyplot 绘制它,如下所示:PCA 图与置信椭圆输出
I am running some PCA analysis on some data using sklearn libraries. I am then doing a scatter plot of my PC1 and PC2 scores and I am adding a 95% confidence ellipse onto the same plot using the answer on this link as my reference PCA Hotelling's 95% Python and then I am plotting it using pyplot as below:PCA plot with confidence ellipse output
如您所见,代码可以正常工作并按预期绘制我的数据,因为标签重叠严重.我想只标记我的异常值(由两个参数方程定义的椭圆外的点),因为这些是我真正感兴趣的唯一点.
As you can see, the code works and plots my data as expected however, since the labels overlap heavily. I was thinking of only labelling my outliers (points outside the ellipse defined by the two parametric equations) as those are the only points I really am interested in.
有什么方法可以先识别我的异常值,然后只标记它们吗?
Is there any way to first identify my outliers and then label them only?
以下是我的代码示例(从上面的链接继承):
Below is my code sample (inherited from link above):
label_buff = pca_raw.iloc[:,2]
labels = label_buff.tolist()
#Calculate ellipse bounds and plot with scores
theta = np.concatenate((np.linspace(-np.pi, np.pi, 50), np.linspace(np.pi, -np.pi, 50)))
circle = np.array((np.cos(theta), np.sin(theta)))
#Where c and d are PC1 and PC2 training score subset for constructing ellipse
sigma = np.cov(np.array((c, d)))
ed = np.sqrt(scipy.stats.chi2.ppf(0.95, 2))
ell = np.transpose(circle).dot(np.linalg.cholesky(sigma) * ed)
c, d = np.max(ell[: ,0]), np.max(ell[: ,1]) #95% ellipse bounds
t = np.linspace(0, 2 * np.pi, 100)
ellipsecos = c * np.cos(t)
ellipsesin = d * np.sin(t)
# a and b are my PC1 and PC2 raw data scores
plt.scatter(a, b, color = "orange")
for i, txt in enumerate(labels):
plt.annotate(txt, (a[i], b[i]), textcoords ='offset points', ha='right', va='bottom' )
plt.plot(ellipsecos, ellipsesin, color = 'black');
plt.show();
我尝试过的 - 如果 ellipsecos 和 ellipsesin 包含定义椭圆的所有点,那么 a 和 b 必须大于位于椭圆外的那些点,但我没有得到预期的结果(所以我认为我没有能够正确建立异常值条件).我更熟悉笛卡尔系统(有可能评估椭圆方程以检查点是否在椭圆内或椭圆外),如果有人可能帮助我使用两个参数方程来建立异常值条件,这将不胜感激.:
What I tried - if ellipsecos and ellipsesin contained all the points defining the ellipse, then a and b would have to be greater than those points to lie outside the ellipse but I didnt get the expected result (So I dont think I have been able to establish the outlier condition correctly). I am more familiar with cartesian system (with the potential to evaluate the ellipse equation to check if the points were in or outside the ellipse) if anyone have perhaps helps me establish the outlier condition using the two parametric equations that would be appreciated.:
#where a and b are PC1 and PC2 scores calculated using sklearn library
for a, b in zip(a, b):
color = 'red' # non-outlier color
if (a > ellipsecos.all() & (b > ellipsesin.all()) ): # condition for being an outlier
color = 'orange' # outlier color
plt.scatter(a, b, color=color)
plt.show()
将不胜感激任何帮助.
推荐答案
pca 库可能有用,因为它使用 Hotelling T2 和 SPE/DmodX 方法提供异常值检测.
The pca library may be of use as it provides outlier detection using Hotelling T2 and SPE/DmodX approach.
此处演示了一个示例:https://stackoverflow.com/a/63043840/13730780.如果您只需要异常值检测,您可以使用特定功能,例如:
An example is demonstrated over here: https://stackoverflow.com/a/63043840/13730780.If you only want the outlier detection, you can use specific functionalities such as:
import pca
outliers_hot = pca.hotellingsT2(PCs, alpha=0.05)
outliers_spe = pca.spe_dmodx(PCs, n_std=2)
这篇关于Python PCA 图(参数椭圆) - 识别和标记异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!