问题描述
我有阿拉伯语的PDF文件,当我使用PDFBox提取文本时,文本字体为Type3,有些字符为空,字体等于null?我想知道这是什么问题?
I have PDF file in Arabic that has text with font Type3 when I extract text using PDFBox some characters are empty and their font equals null? I want to know what is the problem?
代码:
protected void processTextPosition(TextPosition text) {
String character=text.getCharacter(); // is empty
String font=text.getFont().getBaseFont(); // equal null
}
用iText生成的流: (dJ v{dW cG )Tj
我说的是这些问号,为什么我得到这种格式的字符?
I speak about these question marks, why do I get the characters in this format?
这些问号在我的信息流中出现为SOH-STX-ETX-EOT,而不是一个字符。 PDF中的字符显示为'd'和'J'!
These question marks appeared in my stream as "SOH-STX-ETX-EOT", not one character. The character inside PDF is shown as 'd' and 'J'!
推荐答案
Type 3字体是用户定义的字体。例如:用户可以定义字符P对应于以前称为王子的艺术家的符号()这是一个字形,但不是来自任何已知字母的字母。
A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" (TAFKAP) which is a glyph, but not a letter from any known alphabet.
Type 3字体中的字形是一系列的线条和形状,并且iText或PDFBox等程序无法确定哪个字符的含义。你得到一个问号是很正常的。例如:您将使用哪个字符用于符号?
A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark. For instance: which character would you use for this symbol?
以下原因之一适用于包含Type 3字体的PDF:
One of the following reasons applies for a PDF that contains Type 3 fonts:
- 字体是用于引入任何字体不存在的符号。
- 该字体用于混淆PDF的内容,以便无法提取其内容。
- PDF不是以优雅的方式创建的。
如果Type 3字体用于普通字符,您需要使用OCR将内容转换为普通文本。
If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.
这篇关于文本提取为空,未知文本具有使用PDFBox,iText的type3字体(难题!)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!