本文介绍了抓取网页内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个项目,为此,我想在后台抓取网站的内容,并从该抓取的网站中获取一些有限的内容.例如,在我的页面中,我具有用户名"和密码"字段,通过使用这些字段,我将访问我的邮件并抓取收件箱中的内容,并将其显示在我的页面中.

I am developing a project, for which I want to scrape the contents of a website in the background and get some limited content from that scraped website. For example, in my page I have "userid" and "password" fields, by using those I will access my mail and scrape my inbox contents and display it in my page.

我通过单独使用javascript完成了上述操作.但是,当我单击登录"按钮时,页面的URL( http://localhost/web/Login.html )更改为网址( http://mail. in.com/mails/inbox.php?nomail = ... .),我被抓了.但是我在不更改URL的情况下取消了细节.

I done the above by using javascript alone. But when I click the sign in button the URL of my page (http://localhost/web/Login.html) is changed to the URL (http://mail.in.com/mails/inbox.php?nomail=....) which I am scraped. But I scrap the details without changing my url.

推荐答案

一定要使用 PHP简单HTML DOM解析器.快速,简便,超级灵活.它基本上将整个HTML页面粘贴在一个对象中,然后您可以访问该对象中的任何元素.

Definitely go with PHP Simple HTML DOM Parser. It's fast, easy and super flexible. It basically sticks an entire HTML page in an object then you can access any element from that object.

以官方网站为例,在Google主页上获取所有链接:

Like the example of the official site, to get all links on the main Google page:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

这篇关于抓取网页内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 20:46