我用python编写了一个脚本,以获取来自网页的不同链接,这些链接指向不同的文章。运行脚本后,我可以完美地获取它们。但是,我面临的问题是文章链接遍历了多个页面,因为它们的数量很大,可以容纳在单个页面中。如果单击下一页按钮,则可以在开发人员工具中看到的附件信息实际上是通过发布请求产生ajax调用的。由于没有链接到该下一页按钮,因此我找不到任何方法可以转到下一页并从那里解析链接。我已经尝试过使用带有post requestformdata,但是它似乎不起作用。我要去哪里错了?

Link to the landing page containing articles

这是当我单击下一页按钮时使用chrome开发工具所获得的信息:

GENERAL
=======================================================
Request URL: https://www.ncbi.nlm.nih.gov/pubmed/
Request Method: POST
Status Code: 200 OK
Remote Address: 130.14.29.110:443
Referrer Policy: origin-when-cross-origin

RESPONSE HEADERS
=======================================================
Cache-Control: private
Connection: Keep-Alive
Content-Encoding: gzip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: text/html; charset=UTF-8
Date: Fri, 29 Jun 2018 10:27:42 GMT
Keep-Alive: timeout=1, max=9
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03.m_8
NCBI-SID: CE8C479DB3510951_0083SID
Referrer-Policy: origin-when-cross-origin
Server: Apache
Set-Cookie: ncbi_sid=CE8C479DB3510951_0083SID; domain=.nih.gov; path=/; expires=Sat, 29 Jun 2019 10:27:42 GMT
Set-Cookie: WebEnv=1Jqk9ZOlyZSMGjHikFxNDsJ_ObuK0OxHkidgMrx8vWy2g9zqu8wopb8_D9qXGsLJQ9mdylAaDMA_T-tvHJ40Sq_FODOo33__T-tAH%40CE8C479DB3510951_0083SID; domain=.nlm.nih.gov; path=/; expires=Fri, 29 Jun 2018 18:27:42 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block

REQUEST HEADERS
========================================================
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 395
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: ncbi_sid=CE8C479DB3510951_0083SID; _ga=GA1.2.1222765292.1530204312; _gid=GA1.2.739858891.1530204312; _gat=1; WebEnv=18Kcapkr72VVldfGaODQIbB2bzuU50uUwU7wrUi-x-bNDgwH73vW0M9dVXA_JOyukBSscTE8Qmd1BmLAi2nDUz7DRBZpKj1wuA_QB%40CE8C479DB3510951_0083SID; starnext=MYGwlsDWB2CmAeAXAXAbgA4CdYDcDOsAhpsABZoCu0IA9oQCZxLJA===
Host: www.ncbi.nlm.nih.gov
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03
Origin: https://www.ncbi.nlm.nih.gov
Referer: https://www.ncbi.nlm.nih.gov/pubmed
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
X-Requested-With: XMLHttpRequest

FORM DATA
========================================================
p$l: AjaxServer
portlets: id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity
load: yes

到目前为止,这是我的脚本(如果没有注释,则get请求可以完美地工作,但对于第一页而言):
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"

# res = requests.get(geturl,headers={"User-Agent":"Mozilla/5.0"})
# soup = BeautifulSoup(res.text,"lxml")
# for items in soup.select("div.rslt p.title a"):
#     print(items.get("href"))

FormData={
    'p$l': 'AjaxServer',
    'portlets': 'id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity',
    'load': 'yes'
    }

req = requests.post(posturl,data=FormData,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("div.rslt p.title a"):
    print(items.get("href"))

顺便说一句,当我单击下一页链接时,浏览器中的URL变为“https://www.ncbi.nlm.nih.gov/pubmed”。

我不想寻求与任何浏览器模拟器相关的任何解决方案。提前致谢。

最佳答案

内容是高度动态的,因此最好使用selenium或类似的客户端,但是我意识到,由于结果数量如此之多,因此这是不切实际的。因此,我们必须分析浏览器提交的HTTP请求,并使用requests模拟它们。

下一页的内容由POST请求加载到/pubmed,并且发布数据是EntrezForm表单的输入字段。表单提交由js控制(单击“下一页”按钮时触发),并使用.submit()方法执行。

经过一番检查,我发现了一些有趣的领域:

  • EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPageEntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage指示当前页面和下一页。
  • EntrezSystem2.PEntrez.DbConnector.Cmd似乎执行数据库查询。如果我们不提交此字段,结果将不会更改。
  • EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSizeEntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize表示每页结果数。

  • 有了这些信息,我可以使用下面的脚本获得多个页面。
    import requests
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup
    
    geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
    posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
    
    s = requests.session()
    s.headers["User-Agent"] = "Mozilla/5.0"
    
    soup = BeautifulSoup(s.get(geturl).text,"lxml")
    inputs = {i['name']: i.get('value', '') for i in soup.select('form#EntrezForm input[name]')}
    
    results = int(inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.ResultCount'])
    items_per_page = 100
    pages = results // items_per_page + int(bool(results % items_per_page))
    
    inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSize'] = items_per_page
    inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize'] = items_per_page
    inputs['EntrezSystem2.PEntrez.DbConnector.Cmd'] = 'PageChanged'
    
    links = []
    
    for page in range(pages):
        inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage'] = page + 1
        inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage'] = page
    
        res = s.post(posturl, inputs)
        soup = BeautifulSoup(res.text, "lxml")
    
        items = [i['href'] for i in soup.select("div.rslt p.title a[href]")]
        links += items
    
        for i in items:
            print(i)
    

    我要求每页100个项目,因为更高的数字似乎“破坏了”服务器,但是您应该能够通过一些错误检查来调整该数字。

    最后,链接以降序显示(/29960282/29960281,...),因此我认为我们可以在不执行任何POST请求的情况下计算链接:
    geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
    posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
    
    s = requests.session()
    s.headers["User-Agent"] = "Mozilla/5.0"
    soup = BeautifulSoup(s.get(geturl).text,"lxml")
    
    results = int(soup.select_one('[name$=ResultCount]')['value'])
    first_link = int(soup.select_one("div.rslt p.title a[href]")['href'].split('/')[-1])
    last_link = first_link - results
    
    links = [posturl + str(i) for i in range(first_link, last_link, -1)]
    

    但是不幸的是结果并不准确。

    关于python - 无法使用发帖请求转到下一页,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/51100224/

    10-12 21:24