问题描述
在这个问题上我很茫然.我在这里阅读了几乎所有有关它的文章,如果有人朝正确的方向推动我,我将非常感激.
I'm quite at a lost on this subject. I've read pretty much every post about it here on SO, I would very much appreciate it if somebody would nudge me in the right direction.
我有一个PDF,我想提取它的文本,我只对单词和空格感兴趣.我已经设置了CGPDFScanner及其回调方法.我所读的是,就提取文本而言,我只需要考虑4个运算符TJ,Tj,qout(')和doubleqout().
I have a PDF and I would like to extract it's text, I'm only interested in words and spaces. I have setup a CGPDFScanner and it's callback methods. What I have read is that I only need to consider 4 operators TJ, Tj, qout(') and doubleqout(") as far as extracting text goes.
我想我还需要跟踪文本空间,以便确定是否应该将字母放在一起形成一个单词或应该用空格隔开.但是我不知道该怎么做.
I guess I also need to keep track of the text space to be able to determine whether the letters should be put together to form a word or should be separated by a space. But I have no idea how I would have to do this.
在PDF中,所有文本均为以下格式
In the PDF, all text is in the format
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
但是我无法(使用PDF规范)弄清楚这些数字的含义.因此,有人说您不应该害怕PDF规范,但坦率地说,我并不觉得它们很容易阅读/理解.
but I have not been able to figure out (using the PDF specification) what these numbers mean. Somebody on SO said that you should not be scared of the PDF specs but frankly I do not find them very easy to read/understand.
我研究了有用的PDFKitten代码.
I have studied the PDFKitten code which was helpful.
任何帮助将不胜感激.
推荐答案
我不能给您建议如何从PDF中提取单词,但是格式为
I cannot give you advice how to extract words from PDF, but the format of
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
例如, "noreferrer"> PDF 1.7规范中的"9.4.3文本显示运算符"部分. TJ
运算符的描述为:
is explained for example in the PDF 1.7 Specification, section "9.4.3 Text-Showing Operators". The description of the TJ
operator is:
所以数字是字母之间距离的调整.
So the numbers are adjustments to the distance between the letters.
这篇关于iOS PDF到纯文本解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!