Office文档的哈希内容

Office文档的哈希内容

本文介绍了C#-不带元数据的MS Office文档的哈希内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试识别内容重复的文件.决定使用散列机制(MD5,SHA1或任何其他)进行比较.适用于".txt"文件.但是,对于MS Office文件(.doc,.docx,.xls等),此操作将失败.

I am trying to identify files with duplicate contents. Decided to do a comparison using a hashing mechanism (MD5, SHA1 or any other). Works fine with ".txt" files. However, with MS Office files (.doc,.docx,.xls, etc) this fails.

MD5/SHA1哈希也不是恒定的.我假设MS Office在文件中存储某种元数据,每次保存文件时该元数据都会更改.因此哈希是不同的.

MD5/SHA1 hash is not constant for MS Office files, even if they have the same "text" content. I assume MS Office stores some kind of meta-data in the file, which changes each time you save the file. Thus the hash is different.

例如我有一个文件ABC.doc,并为此计算了哈希(Hash1).打开并更改1个字并保存文件.撤消所做的更改,然后保存并计算哈希(Hash2).在这种情况下,Hash1!= Hash2.如果您在".txt"文件上尝试此操作,则相同

e.g. I have a file ABC.doc and I compute the hash (Hash1) for it. Open and change 1 word and save the file. Undo the change you made and save and compute hash (Hash2).Hash1 != Hash2 in this case. It is same if you try this on a ".txt" file

是否有一种基于对内容进行哈希处理来删除MS Office文档的方法?我们可以仅散列文件的内容而不散列其元数据吗?

Is there a way to de-dupe MS Office documents based on hashing its contents? Can we hash only the contents of a file and not its meta-data?

推荐答案

我认为,如果不使用工具提取文档文本然后对文本进行哈希处理,就无法做到这一点.我可以推荐现在由Oracle拥有的Stellent Outside In.但这可能是满足您需求的过分解决方案.它们提供了一种从多种类型的文件中提取文本的工具,包括所有Office文件和版本.

I don't think this can be done without extracting the text of the document using a tool and then hashing the text. I can recommend Stellent Outside In, now owned by Oracle. But that could be an overkill solution to your needs. They provide a tool to extract text from many types of files, including all office files and versions.

这篇关于C#-不带元数据的MS Office文档的哈希内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-31 09:17