本文介绍了从URL解析来自Pdf,txt或docx文件的文本,而无需在Java 8中下载文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!



I need to be able to parse the text contained in a file online with a given url, i.e. http://website.com/document.pdf.


I am making a search engine which basically can tell me if the searched word is in some file online, and retrieve the file's URL, so I don't need to download the file but to just read it.


I was looking for a way and found something with InputStream and OpenConnection but didn't managed to actually do it.


I am using jsoup in order to crawl around a website in order to retrieve the URLs, and I was trying to parse it with a Jsoup method, but it does not work.


So what is the best way to do this?


I want to be able to do something like this:

File in = new File("http://website.com/document.pdf");
Document doc = Jsoup.parse(in, "UTF-8");


您可以使用URL而不是文件来访问URL.因此,使用Apache Tika,您应该可以通过这种方式获取一串内容.

You can use URL instead of file for access to the URL. So using Apache Tika you should be able to grab a string of the content this way.

import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

public class URLReader {
    public static void main(String[] args) throws Exception {

        URL url = new URL("http://website.com/document.pdf");
        ContentHandler contenthandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        PDFParser pdfparser = new PDFParser();
        pdfparser.parse(is, contenthandler, metadata, new ParseContext());


这篇关于从URL解析来自Pdf,txt或docx文件的文本,而无需在Java 8中下载文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:09