本文介绍了如何使用带有粗体,斜体标识的pdftotext.exe提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
亲爱的朋友们,
我一直在使用pdftotext.exe从pdf中提取文本。使用这个文本的准确性很好。但问题是我无法识别粗体和斜体文本。
如何识别提取的文本是粗体还是斜体?
我曾尝试过一些其他的插件,如CSWTestingReflow,PDF解析器等。但为了更好的文本准确性,我使用pdftotext.exe
任何想法都会很明显。 。
示例代码:
Dear friends,
i have been using pdftotext.exe to extract text from pdf. The text accuracy was good by using this. But the problem was i can't able to identify bold and italics text.
How can i identify the extracted text was bold or italic?
I had tried some other plugin like CSWTestingReflow, PDF parser etc..but for better text accuracy i was go with pdftotext.exe
Any idea would be appreciable..
sample code:
objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"
''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """"
If fso.FileExists(sReadPDF & "_Text.txt") = True Then
'Read the text file
Set adoStreamOut = New ADODB.Stream
'adoStreamOut.Charset = "utf-8"
adoStreamOut.Charset = "us-ascii"
If adoStreamOut.State Then adoStreamOut.Close
adoStreamOut.Open
adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"
sText = adoStreamOut.ReadText
End If
DoEvents
sText = Trim(sText)
sText = Trim(Replace(sText, Chr(12), ""))
sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))
sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))
sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))
sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))
sText = Trim(Replace(sText, vbCrLf, " "))
sText = Trim(Replace(sText, ".|||", "." & vbCrLf))
sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))
sText = Trim(Replace(sText, "-|||", ""))
sText = Trim(Replace(sText, "||||||", "--"))
sText = Trim(Replace(sText, "--", "—"))
Do
sText = Trim(Replace(sText, " ", " "))
Loop Until InStr(sText, " ") = False
谢谢
jai
Thanks
jai
推荐答案
这篇关于如何使用带有粗体,斜体标识的pdftotext.exe提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!