问题描述
我正在使用H2O(特别是H2O流)进行K均值聚类.我选择了标准化"复选框,以确保在计算距离之前将列标准化".它训练得很好,我调查了结果.它在结果中显示"within_cluster_sum_of_squares"以供查看.我的问题是"within_cluster_sum_of_squares"是标准化之前还是之后的距离?它看起来在显示标准化后的距离,但是我看到的距离很大,而且似乎在标准化之前(不过我不确定).任何的想法 ?谢谢.当您选择Flow中的K均值标准化时,它会在计算距离之前对列进行标准化(如下所示的设置).
因此要回答您的问题,"within_cluster_sum_of_squares"是在进行标准化后 进行的距离计算.
如果您期望H2O-3 Kmeans标准化选项执行标准化(egnormalize = x/||| x ||)而不是标准化(eg standardize =(x-意思是/sd)
来自k均值文档,这里是标准化选项的概述:
standardize:启用此选项可以将数字列标准化为均值零和单位方差.强烈建议进行标准化;如果您不使用标准化,则结果可能包含由变量主导的组件,这些变量相对于其他属性似乎具有较大的规模差异,而不是真正的贡献.默认情况下启用此选项.
注:如果启用了标准化,则在使用该算法之前,将数值数据的每一列居中并缩放,以使其均值为零且标准偏差为一.在此过程的最后,集群以标准化规模(centers_std)和非标准化规模(centers)为中心.为了使中心标准化,该算法乘以相应列的原始标准偏差,然后加上原始均值.在数学上,启用标准化等效于在R上使用h2o.scale,在数值列上使用center = TRUE和scale = TRUE.因此,无论是否启用K-Means标准化,都不会有明显的区别,因为H2O会计算未标准化的质心.
I am using H2O (H2O flow, in particular) to do K-means clustering. I selected "standardize" checkbox which makes sure "It standardize columns before computing distances". It trained fine and I investigated the results. It depicts "within_cluster_sum_of_squares" in the result for review. My question is "within_cluster_sum_of_squares" the distance BEFORE or AFTER standardization ? It looks displaying distance after standardization, but the distance I see is big and it seems before standardization (I am not sure though). Any idea ? Thanks.
When you select standardize for K-Means in Flow, it does standardize the columns before computing the distances (setting shown below).
So to answer your question the "within_cluster_sum_of_squares" is the distance calculation that is computed after standardization is performed.
One reason your metric value may seem too big could be if you were expecting the H2O-3 Kmeans standardize option to perform normalization (e.g.normalize = x / ||x||) rather than standardization (e.g. standardize = (x - mean) / sd)
From the k-means documentation here is the overview of the standardization option:
standardize: Enable this option to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is enabled by default.
Note: If standardization is enabled, each column of numeric data is centered and scaled so that its mean is zero and its standard deviation is one before the algorithm is used. At the end of the process, the cluster centers on both the standardized scale (centers_std) and the de-standardized scale (centers). To de-standardize the centers, the algorithm multiplies by the original standard deviation of the corresponding column and adds the original mean. Enabling standardization is mathematically equivalent to using h2o.scale in R with center = TRUE and scale = TRUE on the numeric columns. Therefore, there will be no discernible difference if standardization is enabled or not for K-Means, since H2O calculates unstandardized centroids.
这篇关于用于K均值聚类的H2O(开源)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!