使用DBSCAN进行集群：如果不预先设置集群数，如何训练模型？

本文介绍了使用DBSCAN进行集群：如果不预先设置集群数，如何训练模型？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！我正在使用sklearn的内置数据集虹膜进行聚类。在KMeans中，我预先设置了群集数，但对于DBSCAN而言并非如此。如果您不预先设置簇数，该如何训练模型？我尝试过： import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns ＃％matplotib inline sklearn.cluster中的导入DBSCAN，MeanShift sklearn.datasets导入load_iris sklearn.model_selection导入train_test_split，KFold，cross_val_score sklearn.metrics导入precision_score，confusion_matrix iris = load_iris（） X = iris.data y = iris.target dbscan = DBSCAN（eps = 0.3，min_samples = 10） dbscan.fit（X，y）我被卡住了！解决方案 DBSCAN是一种聚类算法，因此，它不使用标签 y 。的确，您可以将其 fit 方法用作 .fit（X，y）的方法，但是，根据文档： y：已忽略未使用，此处用于约定API一致性。 DBSCAN的另一个特点是，与KMeans之类的算法相比，它不将簇数作为输入；相反，它也单独估计。我们已经澄清了，让我们修改就是这样。与所有聚类一样算法，这里是监督学习的常见概念，例如训练/测试拆分，使用看不见的数据进行预测，交叉验证等不成立。为了使我们对我们的数据有一个总体了解，这种无监督的方法可能在初始探索性数据分析（EDA）中很有用-但是，正如您可能已经注意到的那样，这种分析的结果对于有监督的问题：这里，尽管我们的虹膜数据集中存在3个标签，但是该算法仅发现了2个簇。 ...当然，可能会改变，具体取决于模型参数。实验... I am using built-in dataset iris from sklearn for clustering. In KMeans I set the number of clusters in advance but it is not true for DBSCAN. How to train a model if you dont set the number of clusters in advance?I tried:import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns#%matplotib inlinefrom sklearn.cluster import DBSCAN,MeanShiftfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_split,KFold,cross_val_scorefrom sklearn.metrics import accuracy_score,confusion_matrixiris = load_iris()X = iris.datay = iris.targetdbscan = DBSCAN(eps=0.3,min_samples=10)dbscan.fit(X,y)I have got stuck on it! 解决方案 DBSCAN is a clustering algorithm and, as such, it does not employ the labels y. It is true that you can use its fit method as .fit(X, y) but, according to the docs:y: IgnoredNot used, present here for API consistency by convention.The other characteristic of DBSCAN is that, in contrast to algorithms such as KMeans, it does not take the number of clusters as an input; instead, it also estimates their number by itself.Having clarified that, let's adapt the documentation demo with the iris data:import numpy as npfrom sklearn.cluster import DBSCANfrom sklearn import metricsfrom sklearn.datasets import load_irisfrom sklearn.preprocessing import StandardScalerX, labels_true = load_iris(return_X_y=True)X = StandardScaler().fit_transform(X)# Compute DBSCANdb = DBSCAN(eps=0.5,min_samples=5) # default parameter valuesdb.fit(X)core_samples_mask = np.zeros_like(db.labels_, dtype=bool)core_samples_mask[db.core_sample_indices_] = Truelabels = db.labels_# Number of clusters in labels, ignoring noise if present.n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)n_noise_ = list(labels).count(-1)print('Estimated number of clusters: %d' % n_clusters_)print('Estimated number of noise points: %d' % n_noise_)print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(labels_true, labels))print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(labels_true, labels))print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))Result:Estimated number of clusters: 2Estimated number of noise points: 17Homogeneity: 0.560Completeness: 0.657V-measure: 0.604Adjusted Rand Index: 0.521Adjusted Mutual Information: 0.599Silhouette Coefficient: 0.486Let's plot them:# Plot resultimport matplotlib.pyplot as plt# Black removed and is used for noise instead.unique_labels = set(labels)colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = (labels == k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6)plt.title('Estimated number of clusters: %d' % n_clusters_)plt.show()That's it.As with all clustering algorithms, here the usual notions of supervised learning, like train/test split, predict with unseen data, cross validation etc do not hold. Such unsupervised methods may be useful in an initial exploratory data analysis (EDA), in order to give us a general idea about our data - but, as you may have noticed already, it is not necessary that the findings from such analysis are useful for supervised problems: here, despite the existence of 3 labels in our iris dataset, the algorithm uncovered only 2 clusters.... which may of course change, depending on the model parameters. Experiment... 这篇关于使用DBSCAN进行集群：如果不预先设置集群数，如何训练模型？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！