我正在使用pyspark对数据集执行集群。为了找到集群的数量,我在一个值范围内(2,20)执行集群,并为每个值找到wsse的k(在集群平方和内)值。在这里我发现了一些不寻常的东西。根据我的理解,当您增加集群的数量时, >单调递减。但结果却不是这样。我只显示前几个集群的wsse。Results from sparkFor k = 002 WSSE is 255318.793358For k = 003 WSSE is 209788.479560For k = 004 WSSE is 208498.351074For k = 005 WSSE is 142573.272672For k = 006 WSSE is 154419.027612For k = 007 WSSE is 115092.404604For k = 008 WSSE is 104753.205635For k = 009 WSSE is 98000.985547For k = 010 WSSE is 95134.137071如果您查看wsse和wsse的k=5值,您将看到k=6已增加。我转过头去看看我是否得到了类似的结果。我使用SCOLK和SKEXCEL的代码是在文章末尾的附录部分。我尝试在火花和SkPoe- KMead模型中使用相同的参数。以下是SKEXLY的结果,它们是我预期的——单调递减。Results from sklearnFor k = 002 WSSE is 245090.224247For k = 003 WSSE is 201329.888159For k = 004 WSSE is 166889.044195For k = 005 WSSE is 142576.895154For k = 006 WSSE is 123882.070776For k = 007 WSSE is 112496.692455For k = 008 WSSE is 102806.001664For k = 009 WSSE is 95279.837212For k = 010 WSSE is 89303.574467我不知道为什么火花值会增加。我尝试使用不同的数据集,也发现了类似的行为。有什么地方我做错了吗?任何线索都很好。附录数据集位于here。读取数据并设置声明变量# get dataimport pandas as pdurl = "https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv"df_pandas = pd.read_csv(url)df_spark = sqlContext(df_pandas)target_col = 'high_income'numeric_cols = [i for i in df_pandas.columns if i !=target_col]k_min = 2 # 2 in inclusivek_max = 21 # 2i is exlusive. will fit till 20max_iter = 1000seed = 42这是我用来获取SKEXT结果的代码:from sklearn.cluster import KMeans as KMeans_SKLfrom sklearn.preprocessing import StandardScaler as StandardScaler_SKLss = StandardScaler_SKL(with_std=True, with_mean=True)ss.fit(df_pandas.loc[:, numeric_cols])df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:, numeric_cols]))wsse_collect = []for i in range(k_min, k_max): km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i) _ = km.fit(df_pandas_scaled) wsse = km.inertia_ print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse)) wsse_collect.append(wsse)这是我用来得到火花结果的代码from pyspark.ml.feature import StandardScaler, VectorAssemblerfrom pyspark.ml.clustering import KMeansstandard_scaler_inpt_features = 'ss_features'kmeans_input_features = 'features'kmeans_prediction_features = 'prediction'assembler = VectorAssembler(inputCols=numeric_cols, outputCol=standard_scaler_inpt_features)assembled_df = assembler.transform(df_spark)scaler = StandardScaler(inputCol=standard_scaler_inpt_features, outputCol=kmeans_input_features, withStd=True, withMean=True)scaler_model = scaler.fit(assembled_df)scaled_data = scaler_model.transform(assembled_df)wsse_collect_spark = []for i in range(k_min, k_max): km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col, k=i, maxIter=max_iter, seed=seed) km_fit = km.fit(scaled_data) wsse_spark = km_fit.computeCost(scaled_data) wsse_collect_spark .append(wsse_spark) print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark))更新根据@michail n的答案,我更改了sparkwsse模型的wsse和tol值。我重新运行代码,但我看到同样的行为重复。但自从米切尔提到实际上,spark mllib实现了k-means||我把maxIter的数目增加了50倍,重新运行了这个过程,得到了如下结果。For k = 002 WSSE is 255318.718684For k = 003 WSSE is 212364.906298For k = 004 WSSE is 185999.709027For k = 005 WSSE is 168616.028321For k = 006 WSSE is 123879.449228For k = 007 WSSE is 113646.930680For k = 008 WSSE is 102803.889178For k = 009 WSSE is 97819.497501For k = 010 WSSE is 99973.198132For k = 011 WSSE is 89103.510831For k = 012 WSSE is 84462.110744For k = 013 WSSE is 78803.619605For k = 014 WSSE is 82174.640611For k = 015 WSSE is 79157.287447For k = 016 WSSE is 75007.269644For k = 017 WSSE is 71610.292172For k = 018 WSSE is 68706.739299For k = 019 WSSE is 65440.906151For k = 020 WSSE is 66396.106118KMeans的增加从initSteps和wsse消失。如果你看看k=5和k=6以及其他地方,这种行为仍然存在,但至少我知道这是从哪里来的。 (adsbygoogle = window.adsbygoogle || []).push({}); 最佳答案 WSSE没有单调下降的错误。理论上,如果聚类是最优的,WSSE必须单调递减,这意味着从所有可能的K中心簇中得到最好的WSSE。问题在于,k-均值并不一定能够找到最佳的聚类。对于给定的k,它的迭代过程可以从随机起点收敛到局部最小值,它可能是好的,但不是最优的。有多种方法,如“AA>”和“KFiffes”等,都有选择算法的变体,它们更可能选择不同的、分离的质心,并且更可靠地引导到一个好的聚类和火花MLLIB,实际上,实现了k-均值。然而,在选择过程中,仍然存在随机性的因素,不能保证最优聚类。为K=6选择的随机簇集可能导致一个特别次优的聚类,或者它可能在其达到局部最优之前就已经停止。您可以通过手动更改K-means++来改进它。该算法有一个阈值,通过ToL控制最小数量的群集质心运动被认为是重要的,其中较低的值意味着k-均值算法将使质心继续移动更长。用MaxIt增加迭代的最大次数也防止它过早地以可能更多计算的代价停止过早。所以我的建议是重新运行集群 ... #increase from default 20 max_iter= 40 #decrase from default 0.0001 tol = 0.00001 km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col, k=i, maxIter=max_iter, seed=seed , tol = tol ) ... (adsbygoogle = window.adsbygoogle || []).push({}); 10-02 04:28