问题描述
我需要能够使用给定的URL(即http://website.com/document.pdf
)在线解析文件中包含的文本.
I need to be able to parse the text contained in a file online with a given url, i.e. http://website.com/document.pdf
.
我正在制作一个搜索引擎,该引擎基本上可以告诉我所搜索的单词是否在线存在于某个文件中,并检索文件的URL,因此我不需要下载文件而只需要阅读它即可.
I am making a search engine which basically can tell me if the searched word is in some file online, and retrieve the file's URL, so I don't need to download the file but to just read it.
我一直在寻找一种方法,并且找到了InputStream
和OpenConnection
的东西,但是并没有真正做到这一点.
I was looking for a way and found something with InputStream
and OpenConnection
but didn't managed to actually do it.
我正在使用jsoup来在网站上爬网以检索URL,并且我试图使用Jsoup方法对其进行解析,但这是行不通的.
I am using jsoup in order to crawl around a website in order to retrieve the URLs, and I was trying to parse it with a Jsoup method, but it does not work.
那么最好的方法是什么?
So what is the best way to do this?
我希望能够做这样的事情:
I want to be able to do something like this:
File in = new File("http://website.com/document.pdf");
Document doc = Jsoup.parse(in, "UTF-8");
System.out.println(doc.toString());
推荐答案
您可以使用URL而不是文件来访问URL.因此,使用Apache Tika,您应该可以通过这种方式获取一串内容.
You can use URL instead of file for access to the URL. So using Apache Tika you should be able to grab a string of the content this way.
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class URLReader {
public static void main(String[] args) throws Exception {
URL url = new URL("http://website.com/document.pdf");
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
System.out.println(contenthandler.toString());
}
}
这篇关于从URL解析来自Pdf,txt或docx文件的文本,而无需在Java 8中下载文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!