问题描述
我正在Windows 10 jre 1.8.0_181上使用Apache Tika,并且已经使用Maven导入了Tika,并具有以下依赖性:
I am using Apache Tika on Windows 10, jre 1.8.0_181, and I've imported Tika using Maven with the following dependencies:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.21</version>
</dependency>
</dependencies>
我有下面的代码用于使用Tesseract执行OCR(我已经对其进行了独立测试并知道正在工作):
I have the code below for performing OCR using Tesseract (which I have independently tested and know to be working):
public static void OCRTest() {
try {
BufferedImage im = ImageIO.read(new File(OCR_IMAGE));
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");
config.setTesseractPath("C:\\Program Files\\Tesseract-OCR");
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
TesseractOCRParser parser = new TesseractOCRParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try {
parser.parse(im, handler, metadata, parseContext);
System.out.println(handler.toString());
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
} catch (IOException e) {
e.printStackTrace();
}
}
我遇到以下异常:
org.apache.tika.exception.TikaException: Failed to close temporary resources
at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:174)
at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:251)
at test.test.App.OCRTest(App.java:46)
at test.test.App.main(App.java:30)
Caused by: java.nio.file.FileSystemException: C:\Users\m\AppData\Local\Temp\apache-tika-2643805894084124300.tmp: The process cannot access the file because it is being used by another process.
tmp文件位于Temp文件夹中,并且异常似乎来自无法删除它.在Apache Tika论坛上,有一个帖子,其中有人遇到了同样的例外,尽管使用了AutoDetectParser而不是Tesseract.他们的问题似乎与他们导入的jar冲突,但是即使只安装了Apache Tika库,我也遇到了这个问题.
The tmp file is present in the Temp folder, and the exception seemed to come from not being able to delete it. On the Apache Tika forums, there is a post where someone else has run into the same exception, although with the AutoDetectParser and not Tesseract. Their issue seemed to be a conflict in their imported jars, but I run into this issue even with only the Apache Tika libraries installed.
仅在TesseractOCRParser上使用Tika的AutoDetectParser时,我不会遇到这个问题.关于如何解决该异常的任何见解将不胜感激!
I don't run into this issue when using the Tika's AutoDetectParser, only with the TesseractOCRParser. Any insights on how to fix the exception would be appreciated!
推荐答案
我在Apache Tika问题论坛上发布了( https://issues.apache.org/jira/browse/TIKA-2908 ).问题来自TesseractOCRParser关闭开放流的顺序-您可以在此处查看所做的更改: https://github.com/apache/tika/commit/8d386f827eb31e7f1cb189ce942c67a84a0c6bdc?diff=unified#diff-592f390e7558bb6a1fe1c5bc810fe4c8
I posted on the Apache Tika issues forum (https://issues.apache.org/jira/browse/TIKA-2908). The issue came from the order the TesseractOCRParser was closing the open streams - you can see the changes made here: https://github.com/apache/tika/commit/8d386f827eb31e7f1cb189ce942c67a84a0c6bdc?diff=unified#diff-592f390e7558bb6a1fe1c5bc810fe4c8
目前,对于遇到此问题的任何人,请在本地子类TesseractOCRParser包含上述更改,这些更改应在下一个快照版本中推送.
For now, for anyone who runs into this issue, subclass TesseractOCRParser locally to include the above changes, which should be pushed in the next snapshot release.
感谢Tim @ Apache Tika!
Thanks to Tim @ Apache Tika!
这篇关于TikaException:无法关闭临时资源-如何解决?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!