问题描述
说,我已经对数据集进行了聚类,并且有10个聚类。这些群集是不重叠的。但是现在假设我更改了所有数据点的某些功能,然后再次进行聚类。现在,我还有10个群集。
如果再重复说3次,最后我将拥有50个群集。
每个群集都有与之相关的分数,该分数是根据其成分数据点计算得出的。
Say, I have done clustering on my dataset and have 10 clusters. These clusters are non-overlapping. But now assume I changed some feature in all my data points and do clustering again. Now I have 10 more clusters.If I repeat it say 3 more times, at the end I would have 50 clusters.Each cluster has a score associated with it that is calculated from its constituents data points.
这50个群集现在具有重叠的数据点。我想从这50个群集中选择所有可能的不重叠群集,但总得分最高。
These 50 clusters now have overlapping data points. I want to select all possible non-overlapping clusters out of these 50 clusters but with the highest total score.
一种方法是一种贪婪方法,我根据该方法对群集进行排序分数从最高到最小。然后选择得分最高的集群。然后,从那里继续选择具有非重叠数据点的群集,这些群集已经选择了群集。
One way is a greedy method where I sort the clusters based on the score from highest to smallest. Then select highest scoring cluster. Then from there keep selecting clusters that have non-overlapping data points with already selected clusters. But it doesn't seem to be optimal solution although it is fast.
示例:说我有5个具有以下分数的聚类:
Example: say I have 5 clusters with following scores:
C1 =(A,B,C,D,E,F)得分= 10
C1 = (A,B,C,D,E,F) Score = 10
C2 =(A,B,C)得分= 6
C2 = (A,B,C) Score = 6
C3 =(D,E,F)得分= 6
C3 = (D,E,F) Score = 6
C4 =(G,H, I,J)分数= 5
C4 = (G,H,I,J) Score = 5
C5 =(K,L)分数= 7
C5 = (K,L) Score = 7
贪婪的方法将返回{C1,C4,C5},总得分为10 + 5 + 7 = 22,而更好的选择是{C2,C3,C4,C5},总得分为6 + 6 + 5 + 7 = 24。
The greedy approach will return {C1, C4, C5} with a total score of 10+5+7=22, whereas better option is {C2, C3, C4, C5} with a total score of 6+6+5+7=24.
我正在寻找另一种方法,该方法可以提供比上述贪婪方法更好的解决方案。
I am looking for another method that can give an optimal solution or better solution than above mentioned greedy approach.
推荐答案
您可以使用运筹学技术来解决此问题。
You can solve this using operations research techniques.
像使用set-partitioning问题一样用$ p建模此问题
Model this problem like a set-partitioning problem with
Objective function: maximize score
Constraints: each data point is covered exactly once
然后使用MIP求解器或任何其他技术(例如Hill hiller,遗传算法等)对其进行求解。您的问题的规模很小,因此可以通过任何优化算法解决。我也在研究类似的问题,但在航空机组的调度领域。我的问题的规模是如此之大,以至于约4500个航班的航班时刻表(相当于您的数据点)的可能的机组时刻表(相当于您的集群)>成千上万个组合;)
and then solve it using a MIP solver or any other technique (such as Hill climber, Genetic algorithm etc). The scale of your problem is very small, hence solvable by any optimization algorithm. I am also working on a similar problem but in airline crew scheduling domain. The scale of my problem is so huge that the possible crew schedules (equivalent to your clusters) are >zillion combinations for a flight schedule of ~4500 flights (equivalent to your data points) ;)
我已经用python编写了您的示例,并且使用了Gurobi的MIP求解器,可免费用于学术用途。您也可以使用其他MIP求解器。
I have coded your example in python and I have used a MIP solver from Gurobi, available free of cost for academic use. You can use other MIP solvers too.
以下是python代码:
Here is the python code:
from gurobipy import *
import string
data_points = string.ascii_uppercase[:12]
clusters = []
clusters.append(string.ascii_uppercase[:6])
clusters.append(string.ascii_uppercase[:3])
clusters.append(string.ascii_uppercase[3:6])
clusters.append(string.ascii_uppercase[6:10])
clusters.append(string.ascii_uppercase[10:12])
matrix = {}
for dp in string.ascii_uppercase[:12]:
matrix[dp] = [0]*5
for i in range(0, len(clusters)):
for dp in clusters[i]:
matrix[dp][i] = 1
cost = [10, 6, 6, 5, 7]
# Gurobi MIP model
m = Model("Jitin's cluster optimization problem")
m.params.outputflag = 1
x = m.addVars(len(clusters), vtype=GRB.INTEGER, name='x')
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
coef_x[i] = cost[i]
obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)
flight_in_pairings = [[] for i in range(0, 4228)]
for dp,j in zip(data_points, range(0, len(data_points))):
m.addConstr(sum([x[i]*matrix[dp][i] for i in range(0, len(matrix[dp]))]) == 1, "C"+str(j))
m.optimize()
print('Final Obj:', m.objVal)
m.write('results.sol')
代码的输出:
# Solution for model Jitin's cluster optimization problem
# Objective value = 24
x[0] 0
x[1] 1
x[2] 1
x[3] 1
x[4] 1
这篇关于选择不重叠的最佳质量集群的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!