本文介绍了如何在 Ruby 的字符串中检测某些 Unicode 字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定 Ruby 1.8.7 中的字符串(没有使用 p{} 支持 Unicode 属性的令人敬畏的 Oniguruma 正则表达式引擎),我希望能够确定该字符串是否包含一个或多个中文、日文或韩文字符;即

Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.

class String
  def contains_cjk?
    ...
  end
end

>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false

我怀疑这将归结为查看字符串中的任何字符是否在 Unihan CJKV Unicode 块,但我认为值得询问是否有人知道 Ruby 中的现有解决方案.

I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.

推荐答案

(ruby 1.9.2)

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /p{Han}|p{Katakana}|p{Hiragana}|p{Hangul}/)
  end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

p{} 匹配一个字符的 Unicode 脚本.
支持以下脚本:阿拉伯语、亚美尼亚语、巴厘岛语、孟加拉语、Bopomofo、盲文、布吉语、Buhid、Canadian_Aboriginal、Carian、Cham、切诺基语、通用语、科普特语、楔形文字、塞浦路斯语、西里尔语、Deseret、梵文、埃塞俄比亚语、格鲁吉亚语、格拉哥里语、哥特语、希腊语、古吉拉特语、古尔穆基语、韩语、韩语、哈努努语、希伯来语、平假名、继承语、卡纳达语、片假名、Kayah_Li、Kharoshthi、高棉语、老挝语、拉丁语、Lepcha、Limbu、Linear_B、利西亚语、Lydian、马拉雅拉姆语、蒙古语、缅甸语、New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, 腓尼基语, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, 叙利亚语, 他加禄语, Tagbanwa, Tagbanwa, Tai_Le, Thai_Le,藏语、提非那语、乌加里特语、瓦伊语和彝语.

p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmuk Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharosht Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

哇.Ruby 正则表达式源 .

这篇关于如何在 Ruby 的字符串中检测某些 Unicode 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 22:26