问题描述
我正在设计一个LAMP堆栈顶部的存储云软件。
文件可以有一个内部ID,但是存储它们会有很多优点在服务器文件系统中增加一个id作为文件名,但使用散列作为文件名。
如果当前集中的数据库存在数据库中的标识符应该被分散或分散,或者应该建立某种主 - 主高可用性环境。但我还不确定。
客户可以将文件存储在任何字符串下(通常是某种路径和文件名)。
这个字符串保证是唯一的,因为第一个级别就像用户在Amazon S3和Google存储中注册的存储桶。
我的计划是将文件存储为客户端定义路径的散列。
通过这种方式,存储服务器可以直接为文件提供服务,而无需数据库询问哪个ID是因为它可以计算哈希,因此可以即时计算文件名。
但是我害怕碰撞。我目前正在考虑使用SHA1哈希。
听说GIT也使用哈希值和修订标识符。
我知道碰撞的可能性真的很低,但可能。
我不能判断这一点。你或你会不会依赖hash来达到这个目的?
我还可以对路径编码进行一些规范化处理。也许base64作为文件名,但我真的不希望这样,因为它可能会变得混乱,路径可能会变得太长,可能还有其他复杂性。
假设你有一个具有完美属性的散列函数,并假设加密散列函数的方法适用的理论是相同的,适用于。这就是说,给定最大数量的文件,通过使用更大的散列摘要大小,可以使碰撞概率尽可能小。 SHA有160位,所以对于任何实际数量的文件来说,碰撞概率几乎为零。如果您查看链接中的表格,您会发现128位散列与10 ^ 10文件的冲突概率为10 ^ -18。 因为概率足够低,我认为解决方案很好。与行星被小行星击中的概率,磁盘驱动器中无法检测到的错误,在你的记忆中翻转的位等 - 只要这些概率足够低,我们就不担心它们,因为它们将永远不会发生。只需要足够的保证金,并确保这不是最薄弱的环节。有一点需要关注的是散列函数的选择,它可能存在漏洞。是否有任何其他身份验证或用户是否简单地呈现路径并检索文件?
如果您想到一个攻击者试图暴力破解上面的场景,他们会需要先获取2 ^ 18个文件,然后才能获得存储在系统中的其他随机文件(再次假设128位散列和10 ^ 10个文件,那么文件会更少,散列时间更长)。 2 ^ 18是一个相当大的数字,你可以蛮力的速度受到网络和服务器的限制。在x尝试策略之后简单地锁定用户可以完全关闭这个漏洞(这就是许多系统实现这种策略的原因)。构建一个安全的系统很复杂,需要考虑的要点很多,但这种方案可以非常安全。
希望这很有用...
编辑:另一种思考这种方式的方式是,实际上每个加密或认证系统都依赖于某些事件具有极低的安全概率。例如我很幸运,猜测512位RSA密钥的主要因素,但它不太可能让系统被认为是非常安全的。
I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
这篇关于依靠哈希进行文件识别有多安全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!