本文介绍了字节 vs 字符 vs 单词 - n-gram 的粒度是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

至少可以考虑 3 种类型的 n-gram 来表示文本文档:

At least 3 types of n-grams can be considered for representing text documents:

  • 字节级 n-grams
  • 字符级 n-grams
  • 词级 n-grams

我不清楚哪一个应该用于给定的任务(聚类、分类等).我在某处读到,当文本包含拼写错误时,字符级 n-gram 比单词级 n-gram 更受欢迎,因此玛丽爱狗"与玛丽 lpves 狗"保持相似.

It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs".

在选择正确"表示时是否还有其他标准需要考虑?

Are there other criteria to consider for choosing the "right" representation?

推荐答案

评估.选择表示的标准是什么都行.

确实,字符级别(!= 字节,除非您只关心英语)可能是最常见的表示形式,因为它对拼写差异(如果您查看历史记录,则不一定是错误)很有效;拼写更改).因此,出于拼写更正的目的,这很有效.

Indeed, character level (!= bytes, unless you only care about english) probably is the most common representation, because it is robust to spelling differences (which do not need to be errors, if you look at history; spelling changes). So for spelling correction purposes, this works well.

另一方面,Google 图书 n-gram 查看者在他们的图书中使用词级 n-gram语料库.因为他们不想分析拼写,而是随着时间的推移术语使用情况;例如育儿",其中单个词不如它们的组合有趣.这被证明在机器翻译中非常有用,通常被称为冰箱磁铁模型".

On the other hand, Google Books n-gram viewer uses word level n-grams on their books corpus. Because they don't want to analyze spelling, but term usage over time; e.g. "child care", where the individual words aren't as interesting as their combination. This was shown to be very useful in machine translation, often referred to as "refrigerator magnet model".

如果您不处理国际语言,字节也可能有意义.

If you are not processing international language, bytes may be meaningful, too.

这篇关于字节 vs 字符 vs 单词 - n-gram 的粒度是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-06 13:47