本文介绍了在Python中对文本进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对一些文本文档进行聚类,并且一直在研究各种选项.看起来LingPipe可以在不进行事先转换(到向量空间等)的情况下将纯文本聚类,但这是我见过的唯一宣称可以对字符串起作用的工具.

I need to cluster some text documents and have been researching various options. It looks like LingPipe can cluster plain text without prior conversion (to vector space etc), but it's the only tool I've seen that explicitly claims to work on strings.

有没有可以直接将文本聚类的Python工具?如果没有,处理此问题的最佳方法是什么?

Are there any Python tools that can cluster text directly? If not, what's the best way to handle this?

推荐答案

文本聚类的质量主要取决于两个因素:

The quality of text-clustering depends mainly on two factors:

  1. 要聚类的文档之间存在一些相似性概念.例如,通过tfidf-cosine-distance可以很容易地区分向量空间中有关体育和政治的新闻报道.根据这种度量,将产品评论分为好"或差"的难度要大得多.

  1. Some notion of similarity between the documents you want to cluster. For example, it's easy to distinguish between newsarticles about sports and politics in vector space via tfidf-cosine-distance. It's a lot harder to cluster product-reviews in "good" or "bad" based on this measure.

聚类方法本身.你知道会有多少个集群吗?好的,使用kmeans.您不关心准确性,但想显示一个不错的树形结构来导航搜索结果吗?使用某种层次的集群.

The clustering method itself. You know how many cluster there'll be? Ok, use kmeans. You don't care about accuracy but want to show a nice tree-structure for navigation of search-results? Use some kind of hierarchical clustering.

没有文本聚类解决方案,在任何情况下都可以很好地工作.因此,将某些群集软件开箱即用,将数据扔给它,可能还不够.

There is no text-clustering solution, that would work well under any circumstances. And therefore it's probably not enough to take some clustering software out of the box and throw your data at it.

话虽如此,这是我一段时间前使用的一些实验性代码来进行文本聚类.这些文档表示为归一化的tfidf向量,而相似度则用余弦距离来衡量.群集方法本身是 majorclust .

Having said that, here's some experimental code i used some time ago to play around with text-clustering. The documents are represented as normalized tfidf-vectors and the similarity is measured as cosine distance. The clustering method itself is majorclust.

import sys
from math import log, sqrt
from itertools import combinations

def cosine_distance(a, b):
    cos = 0.0
    a_tfidf = a["tfidf"]
    for token, tfidf in b["tfidf"].iteritems():
        if token in a_tfidf:
            cos += tfidf * a_tfidf[token]
    return cos

def normalize(features):
    norm = 1.0 / sqrt(sum(i**2 for i in features.itervalues()))
    for k, v in features.iteritems():
        features[k] = v * norm
    return features

def add_tfidf_to(documents):
    tokens = {}
    for id, doc in enumerate(documents):
        tf = {}
        doc["tfidf"] = {}
        doc_tokens = doc.get("tokens", [])
        for token in doc_tokens:
            tf[token] = tf.get(token, 0) + 1
        num_tokens = len(doc_tokens)
        if num_tokens > 0:
            for token, freq in tf.iteritems():
                tokens.setdefault(token, []).append((id, float(freq) / num_tokens))

    doc_count = float(len(documents))
    for token, docs in tokens.iteritems():
        idf = log(doc_count / len(docs))
        for id, tf in docs:
            tfidf = tf * idf
            if tfidf > 0:
                documents[id]["tfidf"][token] = tfidf

    for doc in documents:
        doc["tfidf"] = normalize(doc["tfidf"])

def choose_cluster(node, cluster_lookup, edges):
    new = cluster_lookup[node]
    if node in edges:
        seen, num_seen = {}, {}
        for target, weight in edges.get(node, []):
            seen[cluster_lookup[target]] = seen.get(
                cluster_lookup[target], 0.0) + weight
        for k, v in seen.iteritems():
            num_seen.setdefault(v, []).append(k)
        new = num_seen[max(num_seen)][0]
    return new

def majorclust(graph):
    cluster_lookup = dict((node, i) for i, node in enumerate(graph.nodes))

    count = 0
    movements = set()
    finished = False
    while not finished:
        finished = True
        for node in graph.nodes:
            new = choose_cluster(node, cluster_lookup, graph.edges)
            move = (node, cluster_lookup[node], new)
            if new != cluster_lookup[node] and move not in movements:
                movements.add(move)
                cluster_lookup[node] = new
                finished = False

    clusters = {}
    for k, v in cluster_lookup.iteritems():
        clusters.setdefault(v, []).append(k)

    return clusters.values()

def get_distance_graph(documents):
    class Graph(object):
        def __init__(self):
            self.edges = {}

        def add_edge(self, n1, n2, w):
            self.edges.setdefault(n1, []).append((n2, w))
            self.edges.setdefault(n2, []).append((n1, w))

    graph = Graph()
    doc_ids = range(len(documents))
    graph.nodes = set(doc_ids)
    for a, b in combinations(doc_ids, 2):
        graph.add_edge(a, b, cosine_distance(documents[a], documents[b]))
    return graph

def get_documents():
    texts = [
        "foo blub baz",
        "foo bar baz",
        "asdf bsdf csdf",
        "foo bab blub",
        "csdf hddf kjtz",
        "123 456 890",
        "321 890 456 foo",
        "123 890 uiop",
    ]
    return [{"text": text, "tokens": text.split()}
             for i, text in enumerate(texts)]

def main(args):
    documents = get_documents()
    add_tfidf_to(documents)
    dist_graph = get_distance_graph(documents)

    for cluster in majorclust(dist_graph):
        print "========="
        for doc_id in cluster:
            print documents[doc_id]["text"]

if __name__ == '__main__':
    main(sys.argv)

对于实际应用程序,您将使用适当的令牌生成器,使用整数而不是令牌字符串,并且不计算O(n ^ 2)距离矩阵...

For real applications, you would use a decent tokenizer, use integers instead of token-strings and don't calc a O(n^2) distance-matrix...

这篇关于在Python中对文本进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 20:42