本文介绍了是否存在Python文本挖掘脚本来对具有多个分类的文本进行分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将描述分为几类

我有一个问题,涉及确定文本描述所属的类别.这些文本描述由用户输入,并且可能包含可以与特定类别匹配的关键字.每个类别都有一组可以匹配的关键字和短语.大约有100个类别.例如,一个文本描述可能看起来像这样,带有边缘的粗麻布过道赛跑者",并且类别面料"包含关键字粗麻布",这样文本说明就可以归入此类

I have a problem that involves determining what category a text description falls under. These text descriptions are entered in by users and may contain keywords that can be matched to a specific category. Each category has a set of keywords and phrases that can be matched to. There are about 100 categories.For example, a text description might look like this, "Burlap aisle runner w/borders", and the category "Fabric" contains the keyword "Burlap", so that the text description could fall under the category

文字描述/类别

带有边线/织物的橙色粗麻布走道跑步者

Orange Burlap aisle runner w/borders/Fabric

但是,有一些例外情况使此分类过程更加困难.

However, there are a couple of exceptions that make this categorization process more difficult.

首先,有些文字说明包含与多个类别匹配的关键字.例如,由于文字描述在类别中相同,因此文字描述可能会归入20个不同类别(总共100个类别).这不允许对文本描述进行正确的分类.

First, there are text descriptions that contain keywords that match to multiple categories. For example, a text description could fall under 20 different categories (out of 100) due to having keywords that are the same in the categories. This does not permit the correct categorization of the text description.

例如,文本描述为带有边框的橙色粗麻布过道赛跑者",其关键字"Orange"属于水果"类别,而由于关键字粗麻布"也属于面料" ".

For example, a text description that is "Orange Burlap aisle runner w/borders", would have a keyword "Orang" that falls under the category "Fruit", while also falling under "Fabric" due to the keyword "Burlap".

文字描述/类别

带有边框/织物,水果的橙色粗麻布过道赛跑者

Orange Burlap aisle runner w/borders/Fabric, Fruit

第二,文本描述中存在与所有类别都不直接匹配的关键字.再次,这不允许对文本描述进行正确的分类.

Second, there are keywords in the text description that do not match directly to any of the categories. Again, this does not permit the correct categorization of the text description.

例如,包含关键字"mouse"的文本描述与类别"Computer Accessory"不直接匹配.

For example, a text description that contains the keyword "mouse" does not match directly with the category "Computer Accessory".

有人可以建议一种算法或python库来对文本描述进行分类而无需直接分类并消除多重分类吗?

Can anyone suggest an algorithm or python library that can classify text descriptions without direct classification and eliminate multi-classification?

我已经分解了文本描述和类别的关键字,然后将它们匹配.

I have broken down the keywords for both the text descriptions and categories, and then matched them.

这是我用来将文本描述与类别进行匹配的代码.

This was the code I used to match the text description with the categories.

%LivyPy3.pyspark

entries['category']=list(map(lambda i:list(map(categories_list.get,i)),entries['text_description']))

但是,此脚本中存在多个分类或根本没有分类.

However, from this script there are either multiple categorization or no categorization at all.

推荐答案

我建议您查找 https://skymind.ai/wiki/word2vec ,要矢量化的单词允许对短语和句子进行矢量化,以将更多上下文应用于该单词.单词到vec模型可以创建更好的单词关联模型.

I suggest you look up https://skymind.ai/wiki/word2vec, word to vectorized allows for vectorization of phrases and sentence to apply more context to the word. Word to vec models create better word association models.

我还将在Google学术搜索中搜索包括NLP和word2vec和NIPS AND归类的论文.该搜索产生了4,300多篇论文,这些论文将为您解决问题提供很多指导.如果您只想在所有类别中选择一个类别,这是一项非常困难的任务.我看到了有关#Mailchimps NLP模型的演讲,该模型用于将客户端内容分类为类别,有时正确的类别实际上是第四个类别.他们创建的模型做得很好,但是仍然无法检测到某些边缘情况,并且包含了一些偏见,这些偏见是针对较常见类别而不是较不常见类别的.

I would also search google scholar for papers including NLP AND word2vec AND NIPS AND categorization. This search yielded 4,300+ papers that would give you a lot of direction in solving your problem. If you only want one category to be chosen over all this is a very difficult task. I saw a presentation on #Mailchimps NLP model for classifying client content into categories and sometimes the correct category would literally be the 4th one. The model they created was very well done but still couldn't detect some edge cases and contained some classic biases toward more common categories over the less common.

https://scholar.google.com/scholar?hl=zh-CN&as_sdt=0%2C11&q=NLP+AND+word2vec+AND+categorization+AND+mailchimp&btnG= 推荐引擎论文与您的任务相关,因为为提出搜索建议而预测少量单词的上下文的复杂性是一个类似的问题.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C11&q=NLP+AND+word2vec+AND+categorization+AND+mailchimp&btnG=The recommendation engine paper is tied to your task because the complexity of predicting context of small amount of words in order to make a search suggestion is a similar problem.

这篇关于是否存在Python文本挖掘脚本来对具有多个分类的文本进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 19:01