问题描述
我有一个管理许多 PDF 文件的 Windows .NET 应用程序.部分文件已损坏.
I have a windows .NET application that manages many PDF Files. Some of the files are corrupt.
2 个问题:我会尝试用我不完美的英语解释...对不起
2 issues: I'll try to explain in my imperfect English...sorry
1.)
如何检测任何 pdf 文件是否正确?
How can I detect if any pdf file is correct ?
我想阅读PDF的标题并检测它是否正确.
I want to read header of PDF and detect if it is correct.
var okPDF = PDFCorrect(@"C:emppdfile1.pdf");
2.)
如何知道文件的byte[](bytearray)是否为PDF文件.
How to know if byte[] (bytearray) of file is PDF file or not.
例如,对于 ZIP 文件,您可以检查前四个字节,看看它们是否匹配本地标头签名,即十六进制
For example, for ZIP files, you could examine the first four bytes and see if they match the local header signature, i.e. in hex
50 4b 03 04
如果(缓冲区[0] == 0x50 && 缓冲区[1] == 0x4b && 缓冲区[2] == 0x03 &&缓冲区[3] == 0x04)
if (buffer[0] == 0x50 && buffer[1] == 0x4b && buffer[2] == 0x03 &&buffer[3] == 0x04)
如果您将其加载到 long 中,则为 (0x04034b50).大卫·皮尔森
If you are loading it into a long, this is (0x04034b50). by David Pierson
我想要 PDF 文件也一样.
I want the same for PDF files.
byte[] dataPDF = ...
var okPDF = PDFCorrect(dataPDF);
var okPDF = PDFCorrect(dataPDF);
.NET 中有任何示例源代码吗?
Any sample source code in .NET?
推荐答案
a.不幸的是,没有简单的方法来确定 pdf 文件是否损坏.通常,问题文件具有正确的标题,因此损坏的真正原因是不同的.PDF 文件实际上是 PDF 对象的转储.该文件包含一个参考表,给出了每个对象从文件开头开始的确切字节偏移位置.因此,很可能损坏的文件有一个损坏的偏移量,或者可能是某些对象丢失了.
a. Unfortunately, there is no easy way to determine is pdf file corrupt. Usually, the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So, most probably corrupted files have a broken offsets or may be some object is missed.
检测损坏文件的最佳方法是使用专门的 PDF 库.有很多用于 .NET 的免费和商业 PDF 库.您可以简单地尝试使用此类库之一加载 PDF 文件.iTextSharp 将是一个不错的选择.
The best way to detect the corrupted file is to use specialized PDF libraries.There are lots of both free and commercial PDF libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.
b.根据 PDF 参考,PDF 文件的标题通常看起来像 %PDF-1.X(其中 X 是一个数字,表示从 0 到 7).并且 99% 的 PDF 文件都有这样的标题.但是,Acrobat Viewer 接受其他类型的标题,即使没有标题对于 PDF 查看器来说也不是真正的问题.因此,如果文件不包含标题,则不应将其视为已损坏.例如,标题可能出现在文件的前 1024 个字节内的某处,或者是 %!PS-Adobe-N.n PDF-M.m
b. According to the PDF reference the header of a PDF file usually looks like %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. However, there are some other kinds of headers which Acrobat Viewer accepts and even absence of a header isn't a real problem for PDF viewers. So, you shouldn't treat file as corrupted if it does not contain a header.E.g., the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m
仅供参考,我是 Docotic PDF 库的开发人员.
Just for your information I am a developer of the Docotic PDF library.
这篇关于检测 PDF 文件是否正确(标题 PDF)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!