问题描述
在这种情况下,我在HDD上有文件,并且想要在数据库中缓存有关它们的信息.鉴于这些文件中的某些文件可能会占用GB的空间,因此这些信息原本需要很长时间才能解析.
I have this situation where I have files on the HDD and I want to cache information about them in a database. Information that would otherwise take a long time to parse given that some of these files can run into GBs.
我的第一个直觉是将文件路径用作文件的唯一标识符,然后将其用作键(TEXT/VARCHAR)并将信息作为值存储在数据库表中.
My first intuition is to use the file path as a unique identifier for a file and I use that as the key (TEXT/VARCHAR) and store the information as value in a database table.
考虑到在某些文件系统中(尤其是在* nix中),文件路径的长度可以是无限的.在数据库中使用文件名作为主键似乎是一个坏主意.仅在字符串字段上建立索引会慢很多,更不用说内存/空间限制了.
Given that under some file systems (especially in *nix), file paths can be of unlimited length. It seems like a bad idea to use file name as a primary key in a database. Just indexing on a string field is going to be much slower, not to mention memory/space constraints.
我想,也许是我从完整文件路径(/usr/xxx/1/2/../abc.xyz)生成SHA-256哈希,并将其用作数据库中的主键(固定宽度).另一个想法是从文件内容生成SHA-256哈希.但是,这也会变得很耗时.
I thought, maybe, I generate SHA-256 hash from the full file path (/usr/xxx/1/2/../abc.xyz) and use this as primary key (fixed width) in my database. Another thought, would be to generate the SHA-256 hash from file contents. However, this can also become quite time consuming.
我的问题是-在这种情况下,哈希冲突的可能性同样较小,因为此出色的.
My question is - in this scenario, are hash collisions as equally less likely, as the answer provided on this excellent thread.
推荐答案
要回答您的问题,仅当您要处理表中的2 ^ 128个文件时,才可能出现哈希冲突问题.假定所有输入的长度都在0 .. + INF之间,并且您使用的哈希算法是 perfect (在实践中SHA-256被认为是 perfect ,但未经证明从理论上来说),并且输出大小恰好是256位.
To answer your question, you would only be likely have issues with hash collisions if you were to approach 2^128 files in your table. This assumes that all inputs are between 0 .. +INF in length and that the hash algorithm you are using is perfect (SHA-256 is considered perfect in practice but not proven in theory) and that the output size is exactly 256 bits.
如果文件少于几十亿,就可以了.
If you have under a few billion files, you should be fine.
现在为我推荐.我要说的是,您需要告诉我们有关您的预期用途的更多信息.您的第一个想法比散列方法更接近正确.
Now for my recommendation. I would say that you need to tell us more information about your intended use. Your first thought is closer to correct than your hashing approach.
我将使用这样的表(SQL Server的T-SQL语法):
I would use a table like this (T-SQL Syntax for SQL Server):
CREATE TABLE [File]
(
[Id] BIGINT IDENTITY NOT NULL,
[Path] CHARACTER VARYING(MAX) NOT NULL
PRIMARY KEY([Id])
);
CREATE NONCLUSTERED INDEX [File_Path_IX] ON [File]([Path]);
然后,您应该让数据库负责建立索引并快速进行搜索.当且仅当您在以后遇到主要性能问题时(由性能分析证明),您才应考虑更改为哈希方法.散列会给您的预处理带来巨大的计算损失,并且会引入复杂的场景,例如散列冲突,并在发生这种情况时以及尝试解决它们时进行尝试.
Then, you should let your database take care of indexing and making the searches fast.If and only if you experience a major performance issue later down the road, demonstrated by profiling, should you consider changing to a hashing approach. The hashing will impose massive computational penalty on your preprocessing and will introduce complicated scenarios such as hash collisions and trying to resolve them if and when they do happen.
这篇关于当使用SHA-256哈希作为主键时,可以忽略冲突的可能性吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!