我目前正在尝试实现一种算法,以查找看起来像实名的字谜。我有一个可行的解决方案,但要花一些时间进行一些查询,我想知道如何改进它。
我正在尝试根据一个拥有50k个姓氏和50k个姓氏的数据库找到由一个姓氏和一个姓氏组成的字谜。数据库的架构如下:
CREATE TABLE `forename` (
`id` int(11) NOT NULL,
`q` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
`label` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`labels` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`labels_length` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
CREATE TABLE `surname` (
`id` int(11) NOT NULL,
`q` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
`label` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`labels` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`labels_length` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
ALTER TABLE `forename`
ADD PRIMARY KEY (`id`),
ADD KEY `idx_length` (`labels_length`);
ALTER TABLE `forename` ADD FULLTEXT KEY `idx_labels` (`labels`);
ALTER TABLE `surname`
ADD PRIMARY KEY (`id`),
ADD KEY `idx_length` (`labels_length`),
ADD KEY `idx_labels` (`labels`);
在每个表中,列的含义如下:
label
:姓氏或肯定名labels
:标签的简化版本:所有大写字母均按字母顺序排序;labels_length
:labels
中的字符数;我目前正在使用php中生成的查询来查询该数据库,例如,对于Ada Lovelace来说,它看起来像:
select distinct A.label as surname, B.label as forename
from forename as A, surname as B WHERE (A.labels not like '%B%' and B.labels not like '%B%') AND
(A.labels not like '%F%' and B.labels not like '%F%') AND
(A.labels not like '%G%' and B.labels not like '%G%') AND
(A.labels not like '%H%' and B.labels not like '%H%') AND
(A.labels not like '%I%' and B.labels not like '%I%') AND
(A.labels not like '%J%' and B.labels not like '%J%') AND
(A.labels not like '%K%' and B.labels not like '%K%') AND
(A.labels not like '%M%' and B.labels not like '%M%') AND
(A.labels not like '%N%' and B.labels not like '%N%') AND
(A.labels not like '%P%' and B.labels not like '%P%') AND
(A.labels not like '%Q%' and B.labels not like '%Q%') AND
(A.labels not like '%R%' and B.labels not like '%R%') AND
(A.labels not like '%S%' and B.labels not like '%S%') AND
(A.labels not like '%T%' and B.labels not like '%T%') AND
(A.labels not like '%U%' and B.labels not like '%U%') AND
(A.labels not like '%W%' and B.labels not like '%W%') AND
(A.labels not like '%X%' and B.labels not like '%X%') AND
(A.labels not like '%Y%' and B.labels not like '%Y%') AND
(A.labels not like '%Z%' and B.labels not like '%Z%') AND
(A.labels like '%A%' or B.labels like '%A%') AND
(A.labels like '%C%' or B.labels like '%C%') AND
(A.labels like '%D%' or B.labels like '%D%') AND
(A.labels like '%E%' or B.labels like '%E%') AND
(A.labels like '%L%' or B.labels like '%L%') AND
(A.labels like '%O%' or B.labels like '%O%') AND
(A.labels like '%V%' or B.labels like '%V%') AND
(A.labels_length + B.labels_length) = 11
该查询的解释是Ada Lovelace的子弹是
AAACDEELLOV
,因此我需要查找包含这些字母且不包含字母表中其他字母的姓氏和别名。我正在对字符数添加过滤器,以尝试限制返回的行数。通过此查询,我得到需要使用PHP处理的结果,以控制使用每个字符的次数正确(例如,对于Ada Lovelace,我的结果包含3 A)。
我当前的数据库包含大约5万个姓氏和5万个别名。当我搜索Ada Lovelace时,在约0.30秒内得到458条SQL行(如果您想知道,则会找到11个精确的字谜)。
如果我更改对Sylvain Lovelace的搜索,则会在10秒钟内得到1774行。慢30倍,而Ada Lovelace可接受的持续时间现在超出范围。我试图删除过滤器上的字符数,并且持续时间降低到8秒,仍然太多。
我很确定应该可以改善数据库的索引或查询的构建方式。如果有人有任何想法,我将非常乐于尝试!
如果有人想对真实数据进行尝试,则转储为available on a github repository。
最佳答案
这里的主要问题是您的数据模型。在两个不同的表中存储名字和姓氏会使您的本地子弹无用,因为它们需要重新组合成全局子弹才能与搜索子弹进行比较。
一种(略微)不太冗长的方法是检查搜索条每个字符的出现次数。对于
where
char_length(a.label) + char_length(b.label) = char_length('AAACDEELLOV')
and char_length(concat(a.label, b.label))
- char_length(replace(upper(concat(a.label, b.label)), 'A', '')) = 3
and char_length(a.label) + char_length(b.label)
- char_length(replace(upper(concat(a.label, b.label)), 'C', '')) = 1
and char_length(a.label) + char_length(b.label)
- char_length(replace(upper(concat(a.label, b.label)), 'D', '')) = 1
and char_length(a.label) + char_length(b.label)
- char_length(replace(upper(concat(a.label, b.label)), 'E', '')) = 2
and char_length(a.label) + char_length(b.label)
- char_length(replace(upper(concat(a.label, b.label)), 'L', '')) = 2
and char_length(a.label) + char_length(b.label)
- char_length(replace(upper(concat(a.label, b.label)), 'O', '')) = 1
and char_length(a.label) + char_length(b.label)
- char_length(replace(upper(concat(a.label, b.label)), 'V', '')) = 1
但最重要的是,最好生成一个存储全名(名字和姓氏)和关联的数据段的唯一表,以修复数据模型。
create table fullnames (
id int auto_increment primary key
name varchar(100),
slug varchar(100)
);
您可以使用递归cte从旧表中馈送新表,该递归cte会生成子段:
insert into fullnames(name, slug)
with recursive cte as (
select
concat(f.label, ' ', s.label) name,
upper(concat(f.label, s.label) slug_name,
0 pos,
'' char_at_pos,
char_length(concat(f.label, s.label)) slug_length
from forename f
cross join surname s
union all
select
name,
slug_name,
pos + 1
substring(slug_name, pos + 1, 1),
slug_length
from cte
where pos + 1 <= slug_length
)
select name, group_concat(char_at_pos order by char_at_pos separator '') slug
from cte
group by name
然后可以直接查询该表:
select * from fullnames where slug = 'AAACDEELLOV';
当然,您也可以使用递归cte的结果来搜索目标段,但是我希望性能不会很好。