问题描述
我希望这可以使用Python来完成!我用相同的数据的两个集群方案,现在有来自两个群集文件。我重新格式化文件,使它们看起来是这样的:
Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:
Cluster 0:
Brucellaceae(10)
Brucella(10)
abortus(1)
canis(1)
ceti(1)
inopinata(1)
melitensis(1)
microti(1)
neotomae(1)
ovis(1)
pinnipedialis(1)
suis(1)
Cluster 1:
Streptomycetaceae(28)
Streptomyces(28)
achromogenes(1)
albaduncus(1)
anthocyanicus(1)
etc.
这些文件包含的细菌种类的信息。所以我有,然后用鼠标右键下方的簇号(群集0)家庭(Brucellaceae)和细菌在家庭数(10)。根据该是每个属在家庭中找到属(名称后面加上数字,布鲁氏菌(10)),最后种(流产(1),等等)。
These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).
我的问题:我有2个文件,以这种方式格式化,并希望写一个程序,将寻找两者之间的差异。唯一的问题是,以不同的方式在两个程序的群集,所以两个集群可能是相同的,即使实际的簇号是不同的(如此群集1的在一个文件中的内容可能与另一个文件匹配群集43中,唯一的不同是实际的簇号)。所以,我需要的东西忽略簇号和重点集群的内容。
My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.
有什么办法,我可以比较这2个文件审查的区别?它甚至有可能?任何想法将大大AP preciated!
Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!
推荐答案
由于:
file1 = '''Cluster 0:
giant(2)
red(2)
brick(1)
apple(1)
Cluster 1:
tiny(3)
green(1)
dot(1)
blue(2)
flower(1)
candy(1)'''.split('\n')
file2 = '''Cluster 18:
giant(2)
red(2)
brick(1)
tomato(1)
Cluster 19:
tiny(2)
blue(2)
flower(1)
candy(1)'''.split('\n')
这是你所需要的?
Is this what you need?
def parse_file(open_file):
result = []
for line in open_file:
indent_level = len(line) - len(line.lstrip())
if indent_level == 0:
levels = ['','','']
item = line.lstrip().split('(', 1)[0]
levels[indent_level - 1] = item
if indent_level == 3:
result.append('.'.join(levels))
return result
data1 = set(parse_file(file1))
data2 = set(parse_file(file2))
differences = [
('common elements', data1 & data2),
('missing from file2', data1 - data2),
('missing from file1', data2 - data1) ]
要看到的区别:
for desc, items in differences:
print desc
print
for item in items:
print '\t' + item
print
打印
common elements
giant.red.brick
tiny.blue.candy
tiny.blue.flower
missing from file2
tiny.green.dot
giant.red.apple
missing from file1
giant.red.tomato
这篇关于如何比较集群?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!