我试图学习一个项目的html抓取,我使用python和lxml。到目前为止,我已经成功地获得了所需的数据,但现在我又遇到了另一个问题。当您向下滚动时,我正在从(op.gg)中删除的站点将添加包含更多信息的新表。当我运行我的脚本(如下)时,它只得到前50个条目,没有更多。我的问题是,我如何才能得到页面上至少前200个名字,或者如果这是可能的。
from lxml import html
import requests
page = requests.get('https://na.op.gg/ranking/ladder/')
tree = html.fromstring(page.content)
names = tree.xpath('//td[@class="SummonerName Cell"]/a/text()')
print (names)
最佳答案
借用Pedro的想法,https://na.op.gg/ranking/ajax2/ladders/start=number
将为您提供从数字开始的50条记录,例如:https://na.op.gg/ranking/ajax2/ladders/start=0
获取(1-50),https://na.op.gg/ranking/ajax2/ladders/start=50
获取(51-100),https://na.op.gg/ranking/ajax2/ladders/start=100
获取(101-150),https://na.op.gg/ranking/ajax2/ladders/start=150
获取(151-200),
等。。。。
之后,您可以更改废品代码,因为页面与原始页面不同,假设您想要获得前200个名称,下面是修改后的代码:
from lxml import html
import requests
start_url = 'https://na.op.gg/ranking/ajax2/ladders/start='
names_200 = list()
for i in [0,50,100,150]:
dest_url = start_url + str(i)
page = requests.get(dest_url)
tree = html.fromstring(page.content)
names_50 = tree.xpath('//a[not(@target) and not(@onclick)]/text()')
names_200.extend(names_50)
print names_200
print len(names_200)
输出:
[u'am\xc3\xa9liorer', 'pireaNn', 'C9 Ray', 'P1 Pirean', 'Pobelter', 'mulgokizary', 'consensual clown', 'Jue VioIe Grace', 'Deep Learning', 'Keegun', 'Free Papa Chau', 'C9 Gun', 'Dhokla', 'Arrowlol', 'FOX Brandini', 'Jurassiq', 'Win or Learn', 'Acoldblazeolive', u'R\xc3\xa9venge', u'M\xc3\xa9ru', 'Imaqtpie', 'Rohammers', 'blaberfish2', 'qldurtms', u'd\xc3\xa0wolfsclaw', 'TheOddOrange', 'PandaTv 656826', 'stuntopolis', 'Butler Delta', 'P1 Shady', 'Entranced', u'Linsan\xc3\xadty', 'Ablazeolive', 'BukZacH', 'Anivia Kid', 'Contractz', 'Eitori', 'MistyStumpey', 'Prodedgy', 'Splitting', u'S\xc4\x99b B\xc4\x99rnal', 'N For New York', 'Naeun', '5tunt', 'C9 Winter', 'Doubtfull', 'MikeYeung', 'Rikara', u'RAH\xc3\x9cLK', ' Sudzzi', 'joong ki song', 'xWeixin VinLeous', 'rhubarbs', u'Ch\xc3\xa0se', 'XueGao', 'Erry', 'C9 EonYoung', 'Yeonbee', 'M ckg', u'Ari\xc3\xa1na Lovato', 'OmarGod', 'Wiggily', 'lmpactful', 'Str1fe', 'LL Stylish', '2017', 'FlREFLY', 'God Fist Monk', 'rWeiXin VinLeous', 'Grigne', 'fantastic ad', 'bobqinX', 'grigne 1v10', 'Sora1', 'Juuichi san ', 'duoking2', 'SandPaperX', 'Xinthus', 'TwichTv CoMMa', 'xFSN Rin', 'UBC CJ', 'PotIuck', 'DarkWingsForSale', 'Get After lt', 'old chicken', u'\xc4\x86ris', 'VK Deemo', 'Pekin Woof', 'YIlIlIlIlI', 'RiceLegend', 'Chimonaa1', 'DJNDREE5', u'CloudNguy\xc3\xa9n', 'Diamond 1 Khazix', 'dawolfsfang', 'clg imaqtpie69', 'Pyrites', 'Lava', 'Rathma', 'PieCakeLord', 'feed l0rd', 'Eygon', 'Autolycus1', 'FateFalls 20xx', 'nIsHIlEzHIlA', 'C9 Sword', 'TET Fear', 'a very bad time', u'Jur\xc3\xa1ssiq', 'Ginormous Noob', 'Saskioo', 'S D 2 NA', 'C9 Smoothie', 'dufTlalgkqtlek', 'Pants are Dragon', u'H\xc3\xb3llywood', 'Serenitty', 'Waggily ', 'never lucky help', u'insan\xc3\xadty', 'Joyul', 'TheeBrandini', 'FoTheWin', 'RyuShoryu', 'avi is me', 'iKingVex', 'PrismaI', 'An Obese Panda', 'TdollasAKATmoney', 'feud999', 'Soligo', 'Steel I', 'SNH48 Ruri', 'BillyBoss1', 'Annie Bot', 'Descraton', 'Cris', 'GrayHoves', 'RegisZZ', 'lron Pyrite', 'Zaion', 'Allorim', 't d', u'Alex \xc3\xafch', 'godrjsdnd', 'DOUBLELIFTSUCKS', 'John Mcrae', u'Lobo Solitari\xc3\xb3', 'MikeYeunglol', 'i xo u', 'NoahMost', 'Vsionz', 'GladeGleamBright', 'Tuesdayy', 'RealDarkness', 'CC Dean', 'na mid xd LFT', 'Piggy Kitten', 'Abou222', 'TG Strompest', 'MooseHater', 'Day after Day', 'bat8man', 'AxAxAxAxA', 'Boyfriend', 'EvanRL', '63FYWJMbam', 'Fiftygbl', u'Br\xc4\xb1an', 'MlST', u'S\xc3\xb8ren Bjerg', 'FOX Akaadian', '5word', 'tchikou', 'Hakuho', 'Noobkiller291', 'woxiangwanAD', 'Doublelift', 'Jlaol', u'z\xc3\xa3ts', 'Cow Goes Mooooo', u'Be Like \xc3\x91e\xc3\xb8\xc3\xb8', 'Liquid Painless', 'Zergy', 'Huge Rooster', 'Shiphtur', 'Nikkone', 'wiggily1', 'Dylaran', u'C\xc3\xa0m', 'byulbit', 'dirtybirdy82', 'FreeXpHere', u'V\xc2\xb5lcan', 'KaNKl', 'LCS Actor 4', 'bie sha wo', 'Mookiez', 'BKSMOOTH', 'FatMiku']
200
顺便说一句,你可以根据你的需求扩展它。
关于python - 从页面上刮取HTML数据,从而在滚动时添加新表,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/43699975/