问题描述
我是usig tess4j(net.sourceforge.tess4j:tess4j:4.4.0),并尝试对pdf文件进行OCR.因此,据我了解,我必须先将pdf转换为tiff或png(建议使用其中的任何一个?),我是这样做的:
I am usig tess4j (net.sourceforge.tess4j:tess4j:4.4.0) and try OCR on pdf files.So as I understood I have to transform the pdf first to tiff or png (any of those suggested?) what I did like this:
tesseract.doOCR(PdfUtilities.convertPdf2Tiff(inputPdfFile));
并收到以下警告:
Warning: Invalid resolution 0 dpi. Using 70 instead.
问题
- 它对我的扫描结果有影响吗? (如果没有,那么很好-我可以关闭警告)
- 有没有办法手动设置DPI,或者
convertPdf
应该为我处理吗?
- Does it has any influence on my scan results? (if not, ok - I can switch off the warning)
- Is there a way to set the DPI by hand or should
convertPdf
handle this for me?
推荐答案
如果图像元数据中没有分辨率信息,则Tesseract会尝试自行估计分辨率,以便可以在结果中计算字体大小信息.
If no resolution information is in image metadata, Tesseract tries to estimate the resolution by itself so that font size information can be calculated in results.
您可以尝试使用以下API来设置输入图像的分辨率:
You can try the following APIs to set input image resolution:
instance.SetTessVariable("user_defined_dpi", "300");
或
TessBaseAPISetSourceResolution(TessBaseAPI handle, int ppi);
您可以通过以下方式禁止控制台输出:
You can suppress console output by:
instance.setTessVariable("debug_file", "/dev/null");
这篇关于Tess4j-Pdf到Tiff到tesseract-“警告:无效的分辨率0 dpi.而是使用70."的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!