r - 用R进行OCR | 进行OCR

我一直在尝试在R中进行OCR（读取PDF数据，该数据作为扫描图像）。一直在阅读有关此内容<http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

这是一个非常好的帖子。

有效的3个步骤：

将pdf转换为ppm（图像格式）
将ppm转换为准备用于tesseract的tif（使用ImageMagick进行转换）
将tif转换为文本文件

根据链接发布的上述3个步骤的有效代码：

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

前两个步骤运行良好。（尽管要花费大量时间，但是对于pdf的4页来说，但是稍后会研究可伸缩性部分，首先尝试一下是否可行）

运行此程序时，拳头两步工作正常。

在执行第3步时，即

shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))

我有这个错误：

错误：评估嵌套太深：无限递归/选项（表达式=）？

或Tesseract崩溃了。

任何解决方法或根本原因分析将不胜感激。

最佳答案

新发布的tesseract软件包可能值得一试。它允许您在R内部执行整个过程，而无需shell调用。

使用help documentation of the tesseract package中使用的过程，您的函数将如下所示：

lapply(myfiles, function(i){
  # convert pdf to jpef/tiff and perform tesseract OCR on the image

  # Read in the PDF
  pdf <- pdf_text(i)
  # convert pdf to tiff
  bitmap <- pdf_render_page(news, dpi = 300)
  tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
  # perform OCR on the .tiff file
  out <- ocr(paste0, (".tiff"))
  # delete tiff file
  file.remove(paste0(i, ".tiff" ))
})