python - 使用python刮取包含多个部分的页面

我想刮掉这个site以获取队友的完整列表。我知道如何在第一页使用beautifoulsoup进行操作，但是结果在很多页面中都被破坏了。有没有办法刮掉所有零件？

谢谢！

最佳答案

https://www.transfermarkt.co.uk/yvon-mvogo/profil/spieler/147051

https://www.transfermarkt.co.uk/steve-von-bergen/profil/spieler/4793

https://www.transfermarkt.co.uk/scott-sutter/profil/spieler/34520

上面给出了一些球员资料的链接。您可以在BeautifulSoup中打开页面并进行解析以获取其中的所有链接。之后写一个正则表达式以仅过滤出满足上述模式的链接，并编写另一个函数以提取个人资料页面中的信息

soup = BeautifulSoup(html_page,'html.parser')
for a in soup.find_all('a', href=True):
    m = re.search('/[a-z\-]+/profil/spieler/[0-9]+', a['href'])
    if m:
        found = m.group(0)
        print(found)

输出量

  / michael-frey / profil / spieler / 147043
  / yvon-mvogo / profil / spieler / 147051
  / scott-sutter / profil / spieler / 34520
  / leonardo-bertone / profil / spieler / 194975
  / steve-von-bergen / profil / spieler / 4793
  / alain-nef / profil / spieler / 4945
  / raphael-nuzzolo / profil / spieler / 32574
  / marco-wolfli / profil / spieler / 4860
  / moreno-costanzo / profil / spieler / 41207
  / jan-lecjaks / profil / spieler / 62854
  / alain-rochat / profil / spieler / 4843
  / christoph-spycher / profil / spieler / 2871
  / gonzalo-zarate / profil / spieler / 52731
  / christian-schneuwly / profil / spieler / 52556
  / yuya-kubo / profil / spieler / 186260
  / alexander-farnerud / profil / spieler / 10255
  / salim-khelifi / profil / spieler / 147049
  / alexander-gerndt / profil / spieler / 45881
  / adrian-winter / profil / spieler / 59681
  / victor-palsson / profil / spieler / 97241
  / milan-gajic / profil / spieler / 46928
  / dusan-veskovac / profil / spieler / 28705
  / marco-burki / profil / spieler / 172192
  / elsad-zverotic / profil / spieler / 25542
  / pa-modou / profil / spieler / 66449
  / yoric-ravet / profil / spieler / 82461

您可以遍历所有链接并调用一个函数，该函数从配置文件页面中提取所需的信息。希望这可以帮助

使用此链接。我是通过检查按钮得到的

https://www.transfermarkt.co.uk/michael-frey/gemeinsameSpiele/spieler/147043/ajax/yw2/page/1

您可以在末尾更改编号以获取每一页