使用JSOUP从网页中检索有用的信息

本文介绍了使用JSOUP从网页中检索有用的信息的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！找到页脚元素或id =footer或有一个页脚类的元素？我尝试使用JSOUP检索网页中的所有链接，然后运行正则表达式在里面。但我不能100％确定，从这种方法获取的链接是联系我们的网站页面。 Q2 是否还有其他强大的方法，或者如果我可以同时使用页脚链接和已完成的方法来断定页面是否确实是联系我们的页面？ $ b $但是我不能百分百的确定那个取得的链接...... blockquote> SHORT ANSWER 你永远不会确定。 LONG ANSWER 对于给定的随机HTML页面，您想要查找Contact我们链接。这种工作对于人类来说是微不足道的。这对于电脑来说是一个很大的挑战。我可以在您的案例中看到一些选项：选项1：人群采购获取您想要的联系我们信息的所有网站url 将他们发送到人群服务平台，请求真实的人为您查找信息（Rapidworkers.com，Crowdsource.com，Clickworker.com，亚马逊Mechanical Turk，microworkers.com）检查平台是否提供API。 code> +人工完成的工作 +动态适应未知模式 - 成本货币 - 我们吮吸重复的任务选项2： IA（patten搜寻）培训IA提取信息然后通过您的网站看看 Weka 或。 +自动化任务 +可以长时间执行重复任务 - 可能需要时间构建了一个强大的解决方案 - 误报或完全错过的风险选项3 ：使用Jsoup 仔细研究您定位的网站的模式告诉Jsoup找到您检测到的模式这个选项是一个永无止境的任务。您必须始终以新模式提供给Jsoup。我建议你有一个监控系统，告诉你网站何时逃脱任何已知的模式。 $ b +自动化任务 +可以长时间执行重复任务 - 花时间学习，发现并添加新模式 - 误报或完全错过的风险选项4：以上三个选项的组合 +减少误报的几率或完全失败 +更自信的最终结果 - 花时间学习，发现并增加新的模式 - 成本货币 How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA.E.g. find footer element, or an element with id="footer" or having a footer class?I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.* in it. But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website.Q2Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page? 解决方案 But I cannot be 100% sure on that the fetched link...SHORT ANSWERYou will NEVER be sure.LONG ANSWERFor a given random HTML page, you want to find the "Contact Us" link. This kind of work is trivial for a human. It represents a big challenge for a computer.I can see some options in your case:Option 1: Crowd sourcingFetch all the website urls you want the "Contact Us" informationSend them to a crowd service platform asking real people to find the information for you (Rapidworkers.com, Crowdsource.com, Clickworker.com, Amazon Mechanical Turk, microworkers.com)Check if the platform offer an API.+ work done by human+ dynamically adapt to unknown pattern- cost money- We suck at repetitive tasksOption 2: IA (patten searching)Train an IA for extracting the informationThen through at it your websitesHave a look at Weka for instance or Java-ML.+ Automated task+ Can perform a repetitive task long time- May take time to built a robust solution- Risk of false positive or complete missOption 3: Use JsoupCarefully study the pattern of the websites you targetTell Jsoup to find the pattern you have detectedThis option is a never ending task. You'll have to always feed Jsoup with new patterns. I suggest you having a monitoring system telling you when website escapes any known pattern.+ Automated task+ Can perform a repetitive task long time- Take time for studying, discovering, adding new patterns- Risk of false positive or complete missOption 4: A mix of the three above optionsYou can have the three options working on the websites you target.+ Reduce chances of false positive or complete misses+ More confident final result- Take time for studying, discovering, adding new patterns- Cost money 这篇关于使用JSOUP从网页中检索有用的信息的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！