问题描述
在 Microsoft SQL Server 中,可以指定不区分重音"排序规则(用于数据库、表或列),这意味着可以进行类似的查询
SELECT * FROM users WHERE name LIKE 'João'
查找具有 Joao
名称的行.
我知道可以使用 unaccent_string contrib 函数从 PostgreSQL 中的字符串中去除重音符号,但我我想知道 PostgreSQL 是否支持这些不区分重音"排序规则,所以上面的 SELECT
可以工作.
使用 unaccent 模块 - 这与您链接的内容完全不同.
unaccent 是一个文本搜索字典,可以去除重音符号(变音符号符号)来自词位.
每个数据库安装一次:
CREATE EXTENSION unaccent;
如果您收到如下错误:
错误:无法打开扩展控制文件/usr/share/postgresql//extension/unaccent.control":没有这样的文件或目录
按照相关答案中的说明在您的数据库服务器上安装 contrib 包:
除其他外,它还提供了您可以在示例中使用的函数 unaccent()
(其中似乎不需要 LIKE
).
选择 *来自用户WHERE unaccent(name) = unaccent('João');
索引
要对此类查询使用索引,请在表达式上创建 索引.然而,Postgres 只接受索引的IMMUTABLE
函数.如果函数可以为相同的输入返回不同的结果,则索引可能会静默中断.
unaccent()
仅 STABLE
而非 IMMUTABLE
不幸的是,unaccent()
只是 STABLE
,而不是 IMMUTABLE
.根据 这个pgsql-bugs上的线程,这是由于三个原因:
- 这取决于字典的行为.
- 这本词典没有硬连线连接.
- 因此它还取决于当前的
search_path
,它可以轻松更改.
一些教程在网络上指示只是将函数可变性更改为 IMMUTABLE
.这种暴力破解方法在某些情况下可能会崩溃.
其他人建议 简单的 IMMUTABLE
包装函数(就像我过去自己做的那样).
是否制作 变体一直存在争议有两个参数 IMMUTABLE
明确声明使用的字典.阅读这里或这里.
另一种选择是这个带有 IMMUTABLE unaccent()
函数的模块由 Musicbrainz 提供,在 Github 上提供.自己没有测试过.我想我想出了一个更好的主意:
目前最佳
这种方法比其他浮动解决方案更有效,也更安全.
创建一个 IMMUTABLE
SQL 包装函数,执行带有硬连线模式限定函数和字典的双参数形式.
由于嵌套非不可变函数会禁用函数内联,因此它基于 C 函数的副本,(假)声明的 IMMUTABLE
也是如此.它的唯一目的是在SQL函数包装器中使用.不能单独使用.
需要复杂性,因为无法在 C 函数的声明中硬连线字典.(需要破解 C 代码本身.)SQL 包装函数执行此操作,并允许函数内联 和 表达式索引.
创建或替换函数 public.immutable_unaccent(regdictionary, text)返回文本 LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS'$libdir/unaccent', 'unaccent_dict';创建或替换函数 public.f_unaccent(text)返回文本语言 sql IMMUTABLE PARALLEL SAFE STRICT AS$功能$选择 public.immutable_unaccent(regdictionary 'public.unaccent', $1)$函数$;
从 Postgres 9.5 或更早版本的两个函数中删除 PARALLEL SAFE
.
public
是您安装扩展的架构(public
是默认值).
显式类型声明 (regdictionary
) 可防御恶意用户使用函数的重载变体进行的假设攻击.
如果您缺乏创建 C 函数所需的权限,那么您将回到次优实现:IMMUTABLE
函数包装器围绕 STABLE
unaccent()
模块提供的函数:
创建或替换函数 public.f_unaccent(text)返回文本为$功能$SELECT public.unaccent('public.unaccent', $1) -- 模式限定函数和字典$func$ 语言 sql 不可变并行安全严格;
最后,表达式索引使查询快速:
创建索引 users_unaccent_name_idx ON users(public.f_unaccent(name));
记住在对函数或字典进行任何更改后重新创建索引涉及此函数,例如不会重新创建索引的就地主要版本升级.最近的主要版本都有 unaccent
模块的更新.
调整查询以匹配索引(因此查询计划器将使用它):
SELECT * FROM usersWHERE f_unaccent(name) = f_unaccent('João');
您不需要正确表达式中的函数.在那里,您还可以直接提供像 'Joao'
这样的无重音字符串.
使用 表达式,更快的函数不会转化为更快的查询索引.这在预先计算的值上运行,并且已经非常快了.但是索引维护和查询不使用索引的好处.
Postgres 10.3/9.6.8 等版本加强了客户端程序的安全性.您需要对用于任何索引的函数和字典名称进行模式限定.见:
连字
在 Postgres 9.5 或更早的 中,像 'Œ' 或 'ß' 之类的连字必须手动扩展(如果需要),因为 unaccent()
总是替换为单个字母:
SELECT unaccent('Œ Æ œ æ ß');不重音----------E A E A S
你会喜欢 Postgres 9.6 中对 unaccent 的更新:
扩展 contrib/unaccent
的标准 unaccent.rules
文件以处理所有Unicode 已知的变音符号,并正确扩展连字(Thomas门罗,伦纳德·贝内代蒂)
我的大胆强调.现在我们得到:
SELECT unaccent('Œ Æ œ æ ß');不重音----------OE AE OE AE SS
模式匹配
对于 LIKE
或 ILIKE
具有任意模式,将其与模块 pg_trgm
.创建三元组 GIN(通常更可取)或 GIST 表达式索引.GIN 示例:
创建索引 users_unaccent_name_trgm_idx ON users使用杜松子酒(f_unaccent(名称)gin_trgm_ops);
可用于以下查询:
SELECT * FROM usersWHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');
GIN 和 GIST 索引的维护成本比普通 btree 高:
对于左锚定模式有更简单的解决方案.有关模式匹配和性能的更多信息:
pg_trgm
还提供了有用的 运算符对于相似度"(%
)和距离"(<->
).
Trigram 索引还支持带有 ~
等的简单正则表达式.和 不区分大小写 模式匹配 ILIKE
:
In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like
SELECT * FROM users WHERE name LIKE 'João'
to find a row with a Joao
name.
I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT
above would work.
Use the unaccent module for that - which is completely different from what you are linking to.
Install once per database with:
CREATE EXTENSION unaccent;
If you get an error like:
Install the contrib package on your database server like instructed in this related answer:
Among other things, it provides the function unaccent()
you can use with your example (where LIKE
seems not needed).
SELECT *
FROM users
WHERE unaccent(name) = unaccent('João');
Index
To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE
functions for indexes. If a function can return a different result for the same input, the index could silently break.
unaccent()
only STABLE
not IMMUTABLE
Unfortunately, unaccent()
is only STABLE
, not IMMUTABLE
. According to this thread on pgsql-bugs, this is due to three reasons:
- It depends on the behavior of a dictionary.
- There is no hard-wired connection to this dictionary.
- It therefore also depends on the current
search_path
, which can change easily.
Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE
. This brute-force method can break under certain conditions.
Others suggest a simple IMMUTABLE
wrapper function (like I did myself in the past).
There is an ongoing debate whether to make the variant with two parameters IMMUTABLE
which declares the used dictionary explicitly. Read here or here.
Another alternative would be this module with an IMMUTABLE unaccent()
function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:
Best for now
This approach is more efficient as other solutions floating around, and safer.
Create an IMMUTABLE
SQL wrapper function executing the two-parameter form with hard-wired schema-qualified function and dictionary.
Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE
as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.
The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.
CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
RETURNS text LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;
Drop PARALLEL SAFE
from both functions for Postgres 9.5 or older.
public
being the schema where you installed the extension (public
is the default).
The explicit type declaration (regdictionary
) defends against hypothetical attacks with overloaded variants of the function by malicious users.
If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE
function wrapper around the STABLE
unaccent()
function provided by the module:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1) -- schema-qualify function and dictionary
$func$ LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT;
Finally, the expression index to make queries fast:
CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));
Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent
module.
Adapt queries to match the index (so the query planner will use it):
SELECT * FROM users
WHERE f_unaccent(name) = f_unaccent('João');
You don't need the function in the right expression. There you can also supply unaccented strings like 'Joao'
directly.
The faster function does not translate to much faster queries using the expression index. That operates on pre-computed values and is very fast already. But index maintenance and queries not using the index benefit.
Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:
Ligatures
In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent()
always substitutes a single letter:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
E A e a S
You will love this update to unaccent in Postgres 9.6:
Bold emphasis mine. Now we get:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
OE AE oe ae ss
Pattern matching
For LIKE
or ILIKE
with arbitrary patterns, combine this with the module pg_trgm
in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:
CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);
Can be used for queries like:
SELECT * FROM users
WHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');
GIN and GIST indexes are more expensive to maintain than plain btree:
There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:
pg_trgm
also provides useful operators for "similarity" (%
) and "distance" (<->
).
Trigram indexes also support simple regular expressions with ~
et al. and case insensitive pattern matching with ILIKE
:
这篇关于PostgreSQL 是否支持“不区分重音"?排序规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!