问题描述
在处理文本时,为什么需要专用于该语言的标记器?
When processing text, why would one need a tokenizer specialized for the language?
用空格标记是否足够?在什么情况下,仅使用空格标记化不是一个好主意?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?
推荐答案
标记化是从表面文本中识别语言上有意义的单位(LMU)的标识.
Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.
英语:如果您在新加坡只有一个俱乐部的时间,那么只需Zouk.
English: If you only have time for one club in Singapore, then it simply has to be Zouk.
印度尼西亚:吉卡·安达·汉尼亚·米米利基·瓦克图(Sikaklub di Singapura),佩吉拉·克·祖克(Pergilah ke Zouk).
Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.
日语:シンガポールで一つしかクラブに行く时间がなかったとしたら,このズークに行くべきです.
Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。
韩语:佐克(Zouk를)싱가포르에서한이없다.
Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.
越南:新加坡(Nuubạnchỉcóthờigianghéthămmộtcâulạcbộở).
Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.
文本来源: http://aclweb.org/anthology/Y/Y11 /Y11-1038.pdf
上面的平行文本的标记化版本应如下所示:
The tokenized version of the parallel text above should look like this:
对于英语,这很简单,因为每个LMU用 定界/用空格分隔 .但是,在其他语言中,情况可能并非如此.对于大多数罗马化语言(例如印度尼西亚语),它们具有相同的空格分隔符,可以轻松识别LMU.
For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.
但是,有时LMU是由空格分隔的两个单词"的组合.例如.在上述越南语句子中,您必须阅读thời_gian
(英语中的时间)作为一个令牌而不是两个令牌.将两个单词分成2个标记会产生没有LMU (例如 http://vdict.com/th%E1%BB%9Di,2,0,0.html )或错误的LMU (例如 http://vdict.com/gian,2,0,0.html ).因此,适当的越南语令牌生成器将输出thời_gian
作为一个令牌,而不是thời
和gian
.
However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian
(it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian
as one token rather than thời
and gian
.
对于某些其他语言,其拼字法可能没有空格来分隔单词"或令牌",例如中文,日文,有时是韩文.在这种情况下,令牌化对于计算机识别LMU是必需的.在LMU上经常会有词素/词尾变化,因此在自然语言处理中,有时morphological analyzer
比分词器更有用.
For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer
is more useful than a tokenizer in Natural Language Processing.
这篇关于为什么每种语言都需要标记器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!