问题描述
我正在对部分标记的数据集进行二进制分类.我对它的1有一个可靠的估计,但对它的0没有一个可靠的估计.
I am performing a binary classification of a partially labeled dataset. I have a reliable estimate of its 1's, but not of its 0's.
来自sklearn KMeans文档:
From sklearn KMeans documentation:
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
我想传递一个ndarray,但是我只有1个可靠的质心,而不是2个.
I would like to pass an ndarray, but I only have 1 reliable centroid, not 2.
有没有办法使第K-1个质心和第K个质心之间的熵最大化?另外,是否有一种方法可以手动初始化K-1重心并使用K ++进行其余操作?
Is there a way to maximize the entropy between the K-1st centroids and the Kth? Alternatively, is there a way to manually initialize K-1 centroids and use K++ for the remaining?
================================================ ========
=======================================================
相关问题:
此旨在定义K具有n-1个特征的质心. (我想用n个特征定义k-1个质心).
This seeks to define K centroids with n-1 features. (I want to define k-1 centroids with n features).
这是一个我想要的内容的描述,但是它被开发人员之一解释为错误,并且易于实现[able]"
Here is a description of what I want, but it was interpreted as a bug by one of the developers, and is "easily implement[able]"
推荐答案
我有足够的信心按预期工作,但是如果发现错误,请更正我. (来自极客为极客拼凑而成):
I'm reasonably confident this works as intended, but please correct me if you spot an error. (cobbled together from geeks for geeks):
import sys
def distance(p1, p2):
return np.sum((p1 - p2)**2)
def find_remaining_centroid(data, known_centroids, k = 1):
'''
initialized the centroids for K-means++
inputs:
data - Numpy array containing the feature space
known_centroid - Numpy array containing the location of one or multiple known centroids
k - remaining centroids to be found
'''
n_points = data.shape[0]
# Initialize centroids list
if known_centroids.ndim > 1:
centroids = [cent for cent in known_centroids]
else:
centroids = [np.array(known_centroids)]
# Perform casting if necessary
if isinstance(data, pd.DataFrame):
data = np.array(data)
# Add a randomly selected data point to the list
centroids.append(data[np.random.randint(
n_points), :])
# Compute remaining k-1 centroids
for c_id in range(k - 1):
## initialize a list to store distances of data
## points from nearest centroid
dist = np.empty(n_points)
for i in range(n_points):
point = data[i, :]
d = sys.maxsize
## compute distance of 'point' from each of the previously
## selected centroid and store the minimum distance
for j in range(len(centroids)):
temp_dist = distance(point, centroids[j])
d = min(d, temp_dist)
dist[i] = d
## select data point with maximum distance as our next centroid
next_centroid = data[np.argmax(dist), :]
centroids.append(next_centroid)
# Reinitialize distance array for next centroid
dist = np.empty(n_points)
return centroids[-k:]
它的用法:
# For finding a third centroid:
third_centroid = find_remaining_centroid(X_train, np.array([presence_seed, absence_seed]), k = 1)
# For finding the second centroid:
second_centroid = find_remaining_centroid(X_train, presence_seed, k = 1)
presence_seed和missing_seed是已知的质心位置.
Where presence_seed and absence_seed are known centroid locations.
这篇关于定义k-1个簇质心-SKlearn KMeans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!