I have a problem that involves determining what category a text description falls under. These text descriptions are entered in by users and may contain keywords that can be matched to a specific category. Each category has a set of keywords and phrases that can be matched to. There are about 100 categories.For example, a text description might look like this, "Burlap aisle runner w/borders", and the category "Fabric" contains the keyword "Burlap", so that the text description could fall under the category
Orange Burlap aisle runner w/borders/Fabric
However, there are a couple of exceptions that make this categorization process more difficult.
First, there are text descriptions that contain keywords that match to multiple categories. For example, a text description could fall under 20 different categories (out of 100) due to having keywords that are the same in the categories. This does not permit the correct categorization of the text description.
例如,文本描述为带有边框的橙色粗麻布过道赛跑者",其关键字"Orange"属于水果"类别,而由于关键字粗麻布"也属于面料" ".
For example, a text description that is "Orange Burlap aisle runner w/borders", would have a keyword "Orang" that falls under the category "Fruit", while also falling under "Fabric" due to the keyword "Burlap".
Orange Burlap aisle runner w/borders/Fabric, Fruit
Second, there are keywords in the text description that do not match directly to any of the categories. Again, this does not permit the correct categorization of the text description.
例如,包含关键字"mouse"的文本描述与类别"Computer Accessory"不直接匹配.
For example, a text description that contains the keyword "mouse" does not match directly with the category "Computer Accessory".
Can anyone suggest an algorithm or python library that can classify text descriptions without direct classification and eliminate multi-classification?
I have broken down the keywords for both the text descriptions and categories, and then matched them.
This was the code I used to match the text description with the categories.
entries['category']=list(map(lambda i:list(map(categories_list.get,i)),entries['text_description']))
However, from this script there are either multiple categorization or no categorization at all.
我建议您查找 https://skymind.ai/wiki/word2vec ,要矢量化的单词允许对短语和句子进行矢量化,以将更多上下文应用于该单词.单词到vec模型可以创建更好的单词关联模型.
I suggest you look up https://skymind.ai/wiki/word2vec, word to vectorized allows for vectorization of phrases and sentence to apply more context to the word. Word to vec models create better word association models.
我还将在Google学术搜索中搜索包括NLP和word2vec和NIPS AND归类的论文.该搜索产生了4,300多篇论文,这些论文将为您解决问题提供很多指导.如果您只想在所有类别中选择一个类别,这是一项非常困难的任务.我看到了有关#Mailchimps NLP模型的演讲,该模型用于将客户端内容分类为类别,有时正确的类别实际上是第四个类别.他们创建的模型做得很好,但是仍然无法检测到某些边缘情况,并且包含了一些偏见,这些偏见是针对较常见类别而不是较不常见类别的.
I would also search google scholar for papers including NLP AND word2vec AND NIPS AND categorization. This search yielded 4,300+ papers that would give you a lot of direction in solving your problem. If you only want one category to be chosen over all this is a very difficult task. I saw a presentation on #Mailchimps NLP model for classifying client content into categories and sometimes the correct category would literally be the 4th one. The model they created was very well done but still couldn't detect some edge cases and contained some classic biases toward more common categories over the less common.
https://scholar.google.com/scholar?hl=zh-CN&as_sdt=0%2C11&q=NLP+AND+word2vec+AND+categorization+AND+mailchimp&btnG= 推荐引擎论文与您的任务相关,因为为提出搜索建议而预测少量单词的上下文的复杂性是一个类似的问题.
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C11&q=NLP+AND+word2vec+AND+categorization+AND+mailchimp&btnG=The recommendation engine paper is tied to your task because the complexity of predicting context of small amount of words in order to make a search suggestion is a similar problem.