问题描述
这个问题已经在其他编程语言中提出过,但是你将如何在 Ruby 上执行一个不区分重音的正则表达式?
我当前的代码类似于
scope :by_registered_name, ->(regex){where(:name =>/#{Regexp.escape(regex)}/i)}
我想也许我可以用点替换非字母数字+空白字符,并删除 escape
,但是没有更好的方法吗?如果我这样做,恐怕我会捕捉到奇怪的东西...
我现在的目标是法语,但如果我也可以针对其他语言修复它会很酷.
如果有帮助,我正在使用 Ruby 2.3.
我意识到我的要求实际上有点强,我还需要捕捉破折号之类的东西.我基本上是导入学校数据库(URL 在这里,标签是),我想要人能够通过键入其名称找到他们的学校.搜索查询和搜索请求都可能包含重音符号,我认为最简单的方法是让两者"都不敏感.
- Télécom"应与Telecom"匹配
- "établissement" 应与 "etablissement" 匹配
- Institut supérieur national de l'artisanat - Chambre de métiers et de l'Artisanat en Moselle"应与artisanat chambre de métiers"相匹配
- Ecole hôtelière d'Avignon (CCI du Vaucluse)" 应该与 Ecole hoteliere d'avignon 匹配(括号内可以跳过)
- Ecole française d'hôtesses"应与ecole francaise d'hot"相匹配
我在那个数据库中也发现了一些疯狂的东西,我认为我会考虑清理这个输入
- Académie internationale de management - Hotel & Tourism Management Academy"应与Hotel Tourism"匹配(注意&实际上是写在XML中的
&
)
看起来 MongoDB 的解决方案是使用 text
索引,即变音符号不敏感.支持.
自从我上次使用 MongoDB 已经有很长时间了,但是如果您使用的是 Mongoid,我认为您会在模型中创建一个 text
索引,如下所示:
index(name: "text")
...然后像这样搜索:
scope :by_registered_name, ->(str) {where(:$text => { :$search => str })}
查阅$text
查询的文档操作员了解更多信息.
原始(错误)答案
事实证明,我是在向后思考这个问题,最初写了这个答案.我保留它,因为它可能仍然派上用场.如果您使用的数据库不提供此类功能(就像 MongoDB 提供的功能),可能的解决方法是使用以下技术在数据库中存储经过清理的名称和原始名称,并且然后同样清理查询.
由于您使用的是 Rails,因此您可以使用方便的 ActiveSupport::Inflector.transliterate:
regex =/aäoöuü/transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')# =>呜呜呜"new_regex = Regexp.new(音译)# =>/aaoouu/
或者干脆:
Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))
您会注意到我提供了 '\?'
作为第二个参数,它是将替换任何无效 UTF-8 字符的替换字符串.这是因为默认替换字符串是 "?"
,正如您所知,它在正则表达式中具有特殊含义.
另请注意,ActiveSupport::Inflector.transliterate
比类似的 I18n.transliterate
做得更多.这是它的来源:
def transliterate(string, replacement = "?")I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),:替换 =>替代品)结尾
最里面的方法调用,ActiveSupport::Multibyte::Unicode.tidy_bytes
,清除任何无效的 UTF-8 字符.
更重要的是,ActiveSupport::Multibyte::Unicode.normalize
规范化"字符.例如,ê
看起来像一个字符,但实际上是两个字符:拉丁小写字母 E 和组合圆形重音.调用 I18n.transliterate("ê")
会产生 e?
,这可能不是你想要的,所以调用 normalize
ê
变成 ê
,它只是一个字符:带圆环的拉丁文小写字母 E.在 ê
(前者)上调用 I18n.transliterate
会产生 e?
,这可能不是你想要的,所以 transliterate
之前的 normalize 步骤很重要.(如果您对其工作原理感兴趣,请阅读Unicode 等效和规范化.)>
The question has been asked in other programming languages, but how would you perform an accent insensitive regex on Ruby ?
My current code is something like
scope :by_registered_name, ->(regex){
where(:name => /#{Regexp.escape(regex)}/i)
}
I thought maybe I could replace non-alphanumeric+whitespace characters by dots, and remove the escape
, but is there not a better way ? I'm afraid I could catch weird things if I do that...
I am targeting French right now, but if I could also fix it for other languages that would be cool.
I am using Ruby 2.3 if that can help.
I realize my requirements are actually a bit stronger, I also need to catch things like dashes, etc. I am basically importing a school database (URL here, the tag is <nom>
), and I want people to be able to find their schools by typing its name. Both the search query and search request may contain accents, I believe the easiest way would be to make "both" insensitive.
- "Télécom" should be matched by "Telecom"
- "établissement" should be matched by "etablissement"
- "Institut supérieur national de l'artisanat - Chambre de métiers et de l'Artisanat en Moselle" should be matched by "artisanat chambre de métiers
- "Ecole hôtelière d'Avignon (CCI du Vaucluse)" Should be matched by Ecole hoteliere d'avignon" (for the parenthesis it's okay to skip it)
- "Ecole française d'hôtesses" should be matched by "ecole francaise d'hot"
Also crazy stuff I found in that DB, I will consider sanitizing this input I think
- "Académie internationale de management - Hotel & Tourism Management Academy" Should be matched by "Hotel Tourism" (note the & is actually written
&
in the XML)
It looks like the solution for MongoDB is to use a text
index, which is diacritic insensitive. French is supported.
It's been a long time since I last used MongoDB, but if you're using Mongoid I think you would create a text
index in your model like this:
index(name: "text")
...and then search like this:
scope :by_registered_name, ->(str) {
where(:$text => { :$search => str })
}
Consult the documentation for the $text
query operator for more information.
Original (wrong) answer
Since you're using Rails you can use the handy ActiveSupport::Inflector.transliterate
:
regex = /aäoöuü/
transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')
# => "aaoouu"
new_regex = Regexp.new(transliterated)
# => /aaoouu/
Or simply:
Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))
You'll note that I supplied '\?'
as the second argument, which is the replacement string that will replace any invalid UTF-8 characters. This is because the default replacement string is "?"
, which as you know has special meaning in a regular expression.
Also note that ActiveSupport::Inflector.transliterate
does a little bit more than the similar I18n.transliterate
. Here's its source:
def transliterate(string, replacement = "?")
I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(
ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),
:replacement => replacement)
end
The innermost method call, ActiveSupport::Multibyte::Unicode.tidy_bytes
, cleans up any invalid UTF-8 characters.
More importantly, ActiveSupport::Multibyte::Unicode.normalize
"normalizes" the characters. For example, ê
looks like one character but it's actually two: LATIN SMALL LETTER E and COMBINING CIRCUMFLEX ACCENT. Calling I18n.transliterate("ê")
would yield e?
, which probably isn't what you want, so normalize
is called to turn ê
into ê
, which is just one character: LATIN SMALL LETTER E WITH CIRCUMFLEX. Calling I18n.transliterate
on ê
(the former) would yield e?
, which probably isn't what you want, so that normalize
step before transliterate
is important. (If you're interested in how that works, read about Unicode equivalence and normalization.)
这篇关于Ruby 超级不敏感的正则表达式将学校名称与重音和其他变音符号匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!