问题描述
当前,我正在使用自定义的LocationTextExtractionStrategy从返回TextRenderInfo []的PDF中提取文本.我希望能够确定TextRenderInfo对象(或PDFString,TextRenderInfo的子级)是否出现在特定层中.我不确定这是否可能.要获取PDF中的图层,我正在使用:
Currently I am using a custom LocationTextExtractionStrategy to extract text from a PDF that returns a TextRenderInfo[]. I would like to be able to determine if a TextRenderInfo object (or PDFString, child of TextRenderInfo) appears in a specific layer. I am not sure if this is possible. To get the layers in a PDF, I am using:
Dictionary<string,PdfLayer> layers;
using (var pdfReader = new PdfReader(src))
{
var newSrc = Path.Combine(["new file location"]);
using (var stream = new FileStream(newSrc, FileMode.Create))
{
PdfStamper stamper = new PdfStamper(pdfReader, stream);
layers = stamper.GetPdfLayers();
stamper.Close();
}
pdfReader.Close();
src = newSrc;
}
要提取文本,我正在使用:
To extract the text, I am using:
var textExtractor = new TextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(pdfReader, pdfPageNum,textExtractor);
List<TextRenderInfo> results = textExtractor.Results;
有什么方法可以检查单个TextRenderInfo结果是否存在于第一个代码段中获得的层中.任何帮助将不胜感激.
Is there any way that I can check if the individual TextRenderInfo results exist within the layers obtained in the first code snippet. Any help would be much appreciated.
推荐答案
可以从单个图层中获取内容,但是您必须跳过几个步骤才能解决.具体来说,您将必须重新创建PdfTextExtractor
和PdfReaderContentParser
提供的某些逻辑.
It is possible to get the contents from a single layer, but you'll have to jump through a few hoops to work it out. Specifically, you will have to recreate some of the logic that is provided by the PdfTextExtractor
and PdfReaderContentParser
.
public static String GetText(PdfReader reader, int pageNumber, int streamNumber) {
var strategy = new LocationTextExtractionStrategy();
var processor = new PdfContentStreamProcessor(strategy);
var resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
// assuming you still need to extract the page bytes
byte[] contents = GetContentBytesForPageStream(reader, pageNumber, streamNumber);
processor.ProcessContent(contents, resourcesDic);
return strategy.GetResultantText();
}
public static byte[] GetContentBytesForPageStream(PdfReader reader, int pageNumber, int streamNumber) {
PdfDictionary pageDictionary = reader.GetPageN(pageNum);
PdfObject contentObject = pageDictionary.Get(PdfName.CONTENTS);
if (contentObject == null)
return new byte[0];
byte[] contentBytes = GetContentBytesFromContentObject(contentObject, streamNumber);
return contentBytes;
}
public static byte[] GetContentBytesFromContentObject(PdfObject contentObject, int streamNumber) {
// copy-paste logic from
// ContentByteUtils.GetContentBytesFromContentObject(contentObject);
// but in case PdfObject.ARRAY: only select the streamNumber you require
}
如果您特别希望仅使用PdfTextExtractor
或PdfReaderContentParser
,并要求返回的TextRenderInfo
作为其所在的图层,那么我不确定是否可以轻松实现.有很多问题:
If you're specifically looking to just use PdfTextExtractor
or PdfReaderContentParser
, and ask the returned TextRenderInfo
for the layer it's on, then I'm not sure it will be easily possible. There are a number of problems with that:
-
TextRenderInfo
不存储该信息,因此您必须将其子类化(可能) - 您必须重写创建
TextRenderInfo
对象的逻辑.通过使用PdfTextExtractor
或PdfReaderContentParser
为所有文本运算符( - 最困难的部分是您已经丢失了
ContentByteUtils.GetContentBytesFromContentObject
中的图层信息-因此,您需要以某种方式保留该信息,这会造成一系列问题.
Tj
,TJ
,'
和"
)注册自定义IContentOperator
对象是可能的TextRenderInfo
doesn't store that information, so you'd have to subclass it (which is possible)- you'd have to rewrite the logic that creates the
TextRenderInfo
objects. This is possible by registering customIContentOperator
objects for all text operators (Tj
,TJ
,'
and"
) with thePdfTextExtractor
orPdfReaderContentParser
- the hardest part is that you have already lost layer information in
ContentByteUtils.GetContentBytesFromContentObject
- so you'd need to retain that somehow, which creates its own set of problems.
这篇关于iTextSharp从PDF的单层读取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!