问题描述
我正在尝试从网页内的 url 中抓取数据(instaid、平均喜欢、平均评论):、https://starngage.com/app/global/influencer/ranking/india
I am trying to scrape data (instaid, average likes, average comments) from a url inside the webpage: ,https://starngage.com/app/global/influencer/ranking/india
url 的元素 id 是:@priyankachopra
The element id of the url is : @priyankachopra
同样,我想从同一个表中的所有 1000 个配置文件中抓取数据
Similary I want to scrape data from all 1000 profiles in the same table
谁能告诉我怎么做
from bs4 import BeautifulSoup
from prettytable import PrettyTable
tb = PrettyTable(['Name', 'Insta_ID', 'Followers'])
url = 'https://starngage.com/app/global/influencer/ranking/india'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
table = soup.find('table', class_='table-responsive-sm')
td = table.findAll('tr')
for i in td[1:]:
temp = i.select_one("td:nth-of-type(3)").text
name, insta_id = temp.split('@')
followers = i.select_one("td:nth-of-type(6)").text
tb.add_row([name.strip(), insta_id.strip(), followers.strip()])
print(tb)
推荐答案
你可以这样做,我没有完全测试完整的代码,因为它会花费很多时间,可能需要长达 10 分钟,但我已经测试了部分部分并且是对我来说工作得很好.但如果不起作用,请在评论中问我.代码如下:
You can do this, I hadn't exactly tested complete code because it will take very much time it may take upto 10mins but I had tested part part and is working perfectly fine for me. But if not working ask me in comment. Here's code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
ids=[]
avgc=[]
avgl=[]
for i in range(1,101):
url = f'https://starngage.com/app/global/influencer/ranking/india?page={i}'
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', class_='table-responsive-sm')
trs = table.findAll('tr')
for tr in trs[1:]:
temp = tr.select_one("td:nth-of-type(3)").text
_,insta_id = temp.split('@')
ids.append(insta_id.strip())
for id in ids:
page=requests.get("https://starngage.com/app/global/influencers/"+id)
soup=BeautifulSoup(page.content, 'lxml')
x=soup.find("blockquote").find("p").text.strip()
#You can change this re code. I am not much familar with re. So, if you find any other better approch then comment.
x=re.findall("is \d+",x)
avl,avc=list(map(lambda y: y.replace("is ",""),x))
avgl.append(avl)
avgc.append(avc)
df = pd.DataFrame({"Insta Id":ids,"Avgerage Like":avgl,"Avgerage Commment":avgc})
print(df)
df.to_csv("test.csv")
这篇关于使用美丽的汤从网页中的链接中抓取数据.Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!