问题描述
我在 python 中创建了一个脚本来解析来自网页的不同链接.登陆页面有两个部分.一个是热门体验
,另一个是更多体验
.我目前的尝试可以从两个类别中获取链接.
目前我想要收集的链接类型(很少)位于 Top Experiences
部分下.但是,当我遍历 More Experiences
部分下的链接时,我可以看到它们都指向一个页面,其中有一个名为 Experiences
的部分,在该部分下有链接类似于着陆页中 Top Experiences
下的链接.我想把它们都抓起来.
一个我想要的链接看起来像:https://www.airbnb.com/experiences/20712?source=seo
.
我当前尝试从两个类别中获取链接:
导入请求从 urllib.parse 导入 urljoin从 bs4 导入 BeautifulSoupURL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"def get_links(link):res = requests.get(link)汤 = BeautifulSoup(res.text,"lxml")items = [urljoin(link,item.get("href")) 用于soup.select("div[style='margin-top:16px'] a._1f0v6pq")]退换货品如果 __name__ == '__main__':对于 get_links(URL) 中的项目:打印(项目)
如何解析 Top Experiences
部分下的所有链接以及 Experiences
部分下的链接,可以在遍历 More Experiences?
请查看图片 如果有什么不清楚的.我用的是可用的油漆笔,所以文字可能有点难以理解.
流程:
获取所有
热门体验
链接获取所有
更多体验
链接向所有
More Experiences
链接发送请求,获取每页Experiences
下的链接.
链接所在的div
是相同的,因为所有页面都具有相同的类_12kw8n71
导入请求从 urllib.parse 导入 urljoin从 bs4 导入 BeautifulSoup从时间导入睡眠来自随机导入 randintURL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"res = requests.get(URL)汤 = BeautifulSoup(res.text,"lxml")top_experiences= [urljoin(URL,item.get("href")) 汤.find_all("div",class_="_12kw8n71")[0].find_all('a')]more_experiences= [urljoin(URL,item.get("href")) 汤.find_all("div",class_="_12kw8n71")[1].find_all('a')]生成的体验=[]#访问more_experiences中的每个链接在 more_experiences 中的网址:sleep(randint(1,10))#通过延迟来避免阻塞generate_experiences.extend([urljoin(URL,item.get("href")) 用于汤中的项目.find_all("div",class_="_12kw8n71")[0].find_all('a')])
注意事项:
您需要的链接将出现在三个列表
top_experiences
、more_experiences
和generated_experiences
中我添加了随机延迟以避免被阻止.
不打印列表,因为它太长了.
top_experiences
- 50 个链接more_experiences
- 299 个链接generated_experiences
-14950 个链接
I've created a script in python to parse different links from a webpage. There are two section in the landing page. One is Top Experiences
and the other is More Experiences
. My current attempt can fetch the links from both the categories.
The type of links I wanna collect are (few of them) under the Top Experiences
section at this moment. However, when I traverse the links under More Experiences
section, I can see that they all lead to the page in which there is a section named Experiences
under which there are links that are similar to the links under Top Experiences
in the landing page. I wanna grab them all.
One such desirable link I'm after looks like: https://www.airbnb.com/experiences/20712?source=seo
.
My current attempt fetches the links from both the categories:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
items = [urljoin(link,item.get("href")) for item in soup.select("div[style='margin-top:16px'] a._1f0v6pq")]
return items
if __name__ == '__main__':
for item in get_links(URL):
print(item)
Please check out the image if anything unclear. I used a pen available in paint so the writing may be a little hard to understand.
Process:
Get all
Top Experiences
linksGet all
More Experiences
linksSend a request to all
More Experiences
links one by one and get the links underExperiences
in each page.
The div
under which the links are present are same for all the pages have the same class _12kw8n71
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from time import sleep
from random import randint
URL = "https://www.airbnb.com/sitemaps/v2/experiences_pdp-L0-0"
res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
top_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')]
more_experiences= [urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[1].find_all('a')]
generated_experiences=[]
#visit each link in more_experiences
for url in more_experiences:
sleep(randint(1,10))#avoid blocking by putting some delay
generated_experiences.extend([urljoin(URL,item.get("href")) for item in soup.find_all("div",class_="_12kw8n71")[0].find_all('a')])
Notes:
Your required links will be present in three lists
top_experiences
,more_experiences
andgenerated_experiences
I have added random delay to avoid getting blocked.
Not printing the lists as it will be too long.
top_experiences
- 50 linksmore_experiences
- 299 linksgenerated_experiences
-14950 links
这篇关于无法从网页中抓取不同深度的类似链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!