数据库中的汉明距离/相似性搜索

本文介绍了数据库中的汉明距离/相似性搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个过程，类似于 tineye 生成感知散列，这些是 32 位整数.

我打算将来将这些存储在 sql 数据库(可能是 nosql db)中

然而，我对如何根据哈希的相似性检索记录感到困惑.

有什么想法吗?

解决方案

一个常见的方法(至少对我来说是常见的)是将你的哈希位串分成几个块，然后查询这些块以获得精确匹配.这是一个预过滤"步骤.然后，您可以对返回的结果执行按位汉明距离计算，这应该只是整个数据集的一个较小子集.这可以使用数据文件或 SQL 表来完成.

简单来说:假设您在数据库中有一堆 32 位散列，并且您想找到在查询"散列的 4 位汉明距离内的每个散列:

创建一个包含四列的表:每列将包含 32 位散列的 8 位(作为字符串或整数)切片，即切片 1 到 4.
在 qslice 1 到 4 中以相同的方式对查询哈希进行切片.
查询这个表，使得 qslice1=islice1 或 qslice2=islice2 或 qslice3=islice3 或 qslice4=islice4 中的任何一个.这为您提供了查询哈希的 3 位 (4 - 1) 内的每个数据库哈希.
对于每个返回的散列，与查询散列成对计算精确的汉明距离(从四个切片重建索引端散列)

步骤 4 中的操作次数应远少于对整个表的完整成对汉明计算.

这种方法首先由 Moses Charikar 在其simhash"开创性的paper 和相应的 Google 专利:

汉明空间中的近似最近邻搜索

[...]

给定每个由 d 位组成的位向量，我们选择N = O(n 1/(1+ ) ) 位的随机排列.对于每个随机排列 σ，我们保持排序的顺序 O σ位向量，按位排列的字典顺序由 σ.给定一个查询位向量 q，我们找到近似值最近邻居通过执行以下操作:

对于每个排列 σ，我们对 O σ 执行二分搜索来定位最接近 q 的两个位向量(按照由 σ 置换的位获得的字典顺序).我们现在在每个排序顺序 O σ 检查上方和下方的元素二分查找返回的位置与 q 匹配的最长前缀的长度.

Monika Henziger 在她的论文 "查找近似重复的网页:算法的大规模评估":

3.3 算法 C 的结果

我们将每一页的位串划分为 12 个非重叠 4 字节片段，创建 20B 片段，并计算所有页面的 C 相似度，这些页面至少有一个共同点.这种方法保证找到所有差异高达 11 的页面对，即 C 相似度 373，但可能会因较大的差异而错过一些.

这也在论文检测网络抓取的近似重复中进行了解释作者:Gurmeet Singh Manku、Arvind Jain 和 Anish Das Sarma:

汉明距离问题

定义:给定一组 f 位指纹和一个查询指纹 F、识别是否存在指纹与 F 的区别最多为 k 位.(在批处理模式版本中上面的问题，我们有一组查询指纹而不是单个查询指纹)

[...]

直觉:考虑一个 2 d f 位真正随机指纹的排序表.只关注最重要的 d 位在表中.这些 d 位数字的列表相当于几乎是一个计数器"，因为 (a) 相当多的 2 d 位-存在组合，并且 (b) 很少有 d 位组合是重复.另一方面，最不显着的 f − d位几乎是随机的".

现在选择 d 使得 |d − d|是一个小整数.自从表已排序，单个探针足以识别在 d 个最高有效位位置与 F 匹配的所有指纹.由于 |d − d|小，这种匹配的数量也是预计会很小.对于每个匹配的指纹，我们可以很容易弄清楚它是否在最多 k 位位置与 F 不同与否(这些差异自然会仅限于f - d 最低有效位位置).

上述过程帮助我们找到一个现有的在 k 位位置与 F 不同的指纹，所有这些被限制在最不重要的 f - d 位之间F. 这处理了相当多的案例.覆盖所有的情况下，只需要建立少量的额外排序表，在下一节中正式概述.

PS:FWIW，这些优秀的大脑中的大多数在某种程度上或某个时间都与谷歌有关.

I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints.

I intend to store these in a sql database (maybe a nosql db) in the future

However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes.

Any Ideas?

解决方案

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:

create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.
slice your query hash the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash.
for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.

This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

这篇关于数据库中的汉明距离/相似性搜索的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！