本文介绍了如何从不可搜索的pdf中检测可搜索的pdf?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一堆 pdf 文件,有些是可搜索的常规 pdf 文件,有些是不可搜索的某些文档的扫描版本.我想提取每个pdf的内容.要提取常规 pdf 的内容,我使用 Apache Tika 并从我使用的不可搜索的内容中提取内容 tesseract-ocr.但是我需要区分哪个pdf是nornal pdf,哪个不是.有没有办法做到这一点?
I have a bunch of pdf files, some are regular pdf files which are searchable and some are scanned version of some documents which are not searchable. I would like to extract content of each pdf. To extract content of regular pdfs I use Apache Tika and to extract content from non-searchable ones I'm using tesseract-ocr. However I need to distinguish which pdf is nornal pdf and which is not. Is there any way to do that?
推荐答案
这对你有帮助,
public static boolean isSearchablePdf(String filePath) throws Exception {
String parsedText;
PDFTextStripper pdfStripper = null;
PDDocument document = null;
COSDocument cosDoc = null;
File file = new File(filePath);
boolean isSearchable = true;
PDFParser parser = new PDFParser(new RandomAccessFile(file, "r"));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
document = new PDDocument(cosDoc);
int noOfPages = document.getNumberOfPages();
for (int page = 1; page <= noOfPages; page++) {
pdfStripper.setStartPage(page);
pdfStripper.setEndPage(page);
parsedText = pdfStripper.getText(document);
isSearchable = isSearchable & isSearchablePDFContent(parsedText, page);
if (!isSearchable) {
break;
}
if (page >= 5) {
break;
}
}
if (isSearchable && noOfPages > 10) {
int min = 5;
int max = noOfPages;
for (int i = 0; i < 4; i++) {
int randomNo = min + (int) (Math.random() * ((max - min) + 1));
pdfStripper.setStartPage(randomNo);
pdfStripper.setEndPage(randomNo);
parsedText = pdfStripper.getText(document);
isSearchable = isSearchable & isSearchablePDFContent(parsedText, randomNo);
if (!isSearchable)
break;
}
}
if (isSearchable && noOfPages >= 10) {
for (int page = noOfPages - 5; page < noOfPages; page++) {
pdfStripper.setStartPage(page);
pdfStripper.setEndPage(page);
parsedText = pdfStripper.getText(document);
isSearchable = isSearchable & isSearchablePDFContent(parsedText, page);
if (!isSearchable)
break;
}
}
if (document != null){
document.close();
}
return isSearchable;
}
public static boolean isSearchablePDFContent(String contentOfPdf, int pageNo) throws IOException {
int count = 0;
boolean isSearchable = false;
if (!contentOfPdf.isEmpty()) {
StringTokenizer st = new StringTokenizer(contentOfPdf);
while (st.hasMoreTokens()) {
st.nextToken();
if (count >= 3) {
isSearchable = true;
break;
}
count++;
}
} else {
isSearchable = false;
}
return isSearchable;
}
这篇关于如何从不可搜索的pdf中检测可搜索的pdf?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!