本文介绍了无法从网页中抓取不同深度的类似链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python 中创建了一个脚本来解析来自网页的不同链接.登陆页面有两个部分.一个是热门体验,另一个是更多体验.我目前的尝试可以从两个类别中获取链接.

目前我想要收集的链接类型(很少)位于 Top Experiences 部分下.但是,当我遍历 More Experiences 部分下的链接时,我可以看到它们都指向一个页面,其中有一个名为 Experiences 的部分,在该部分下有链接类似于着陆页中 Top Experiences 下的链接.我想把它们都抓起来.

一个我想要的链接看起来像:https://www.airbnb.com/experiences/20712?source=seo.

网站链接

我当前尝试从两个类别中获取链接:

导入请求从 urllib.parse 导入 urljoin从 bs4 导入 BeautifulSoupURL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"def get_links(link):res = requests.get(link)汤 = BeautifulSoup(res.text,"lxml")items = [urljoin(link,item.get("href")) 用于soup.select("div[style='margin-top:16px'] a._1f0v6pq")]退换货品如果 __name__ == '__main__':对于 get_links(URL) 中的项目:打印(项目)

如何解析 Top Experiences 部分下的所有链接以及 Experiences 部分下的链接,可以在遍历 More Experiences?

查看图片 如果有什么不清楚的.我用的是可用的油漆笔,所以文字可能有点难以理解.

解决方案

流程:

  1. 获取所有热门体验链接

  2. 获取所有更多体验链接

  3. 向所有More Experiences链接发送请求,获取每页Experiences下的链接.

链接所在的div是相同的,因为所有页面都具有相同的类_12kw8n71

导入请求从 urllib.parse 导入 urljoin从 bs4 导入 BeautifulSoup从时间导入睡眠来自随机导入 randintURL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"res = requests.get(URL)汤 = BeautifulSoup(res.text,"lxml")top_experiences= [urljoin(URL,item.get("href")) 汤.find_all("div",class_="_12kw8n71")[0].find_all('a')]more_experiences= [urljoin(URL,item.get("href")) 汤.find_all("div",class_="_12kw8n71")[1].find_all('a')]生成的体验=[]#访问more_experiences中的每个链接在 more_experiences 中的网址:sleep(randint(1,10))#通过延迟来避免阻塞generate_experiences.extend([urljoin(URL,item.get("href")) 用于汤中的项目.find_all("div",class_="_12kw8n71")[0].find_all('a')])

注意事项:

  1. 您需要的链接将出现在三个列表 top_experiencesmore_experiencesgenerated_experiences

  2. 我添加了随机延迟以避免被阻止.

  3. 不打印列表,因为它太长了.

    top_experiences - 50 个链接

    more_experiences - 299 个链接

    generated_experiences -14950 个链接

I've created a script in python to parse different links from a webpage. There are two section in the landing page. One is Top Experiences and the other is More Experiences. My current attempt can fetch the links from both the categories.

The type of links I wanna collect are (few of them) under the Top Experiences section at this moment. However, when I traverse the links under More Experiences section, I can see that they all lead to the page in which there is a section named Experiences under which there are links that are similar to the links under Top Experiences in the landing page. I wanna grab them all.

One such desirable link I'm after looks like: https://www.airbnb.com/experiences/20712?source=seo.

website link

My current attempt fetches the links from both the categories:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"

def get_links(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    items = [urljoin(link,item.get("href")) for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq")]
    return items

if __name__ == '__main__':
    for item in get_links(URL):
        print(item)

Please check out the image if anything unclear. I used a pen available in paint so the writing may be a little hard to understand.

解决方案

Process:

  1. Get all Top Experiences links

  2. Get all More Experiences links

  3. Send a request to all More Experiences links one by one and get the links under Experiences in each page.

The div under which the links are present are same for all the pages have the same class _12kw8n71

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from time import sleep
from random import randint
URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
top_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')]
more_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[1].find_all('a')]
generated_experiences=[]
#visit each link in more_experiences
for url in more_experiences:
    sleep(randint(1,10))#avoid blocking by putting some delay
    generated_experiences.extend([urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')])

Notes:

  1. Your required links will be present in three lists top_experiences , more_experiences and generated_experiences

  2. I have added random delay to avoid getting blocked.

  3. Not printing the lists as it will be too long.

    top_experiences - 50 links

    more_experiences - 299 links

    generated_experiences -14950 links

这篇关于无法从网页中抓取不同深度的类似链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 11:53