我是Web爬网的新手,并且很难弄清楚如何处理问题:我正在爬网的网站正在与我一半的代码合作,但现在与另一半合作。
我正在使用以下抓取代码从mmadecisions.com抓取数据。我成功地拉出了第一页链接,然后成功地打开了那些链接的页面,但是当我到达第三层时,它给了我一个错误。是javascript吗?这很奇怪,因为当我将href链接输入到“ get_single_item_data”函数时,它可以完美运行。这是否意味着我应该使用硒?它距网站一个街区吗?然后为什么抓取的一半有效(对于http://mmadecisions.com/decisions-by-event/2013/http://mmadecisions.com/decision/4801/John-Maguire-vs-Phil-Mulpeter),正如您在下面的输出中看到的那样,在到达第三层之前,我已经打印了href链接:

import requests

 from bs4 import BeautifulSoup

 import time

    my_headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}

    def ufc_spider(max_pages):
        page = 2013
        while page <= max_pages:
            url = 'http://mmadecisions.com/decisions-by-event/'+str(page)+'/'
            print(url)
            source_code = requests.get(url, headers=my_headers)
            plain_text = source_code.text
            soup = BeautifulSoup(plain_text, "html.parser")
            data = soup.findAll('table',{'width':'100%'})[2]
            for link in data.findAll('a', href=True):
                href = 'http://mmadecisions.com/' + str(link.get('href'))
                source_code = requests.get(href, "html.parser")
                plain_text = source_code.text
                soup2 = BeautifulSoup(plain_text, "html.parser")
                tmp = []
                other = soup2.findAll('table',{'width':'100%'})[1]
                for con in other.findAll('td', {'class':'list2'}):
                    CON = con.a
                    ahref = 'http://mmadecisions.com/' + str(CON.get('href'))
                    print(ahref)
                    time.sleep(5)

                    get_single_item_data(ahref)


            page += 1



    def get_single_item_data(item_url):
        tmp = []
        source_code = requests.get(item_url, headers=my_headers)
        time.sleep(10)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        print(soup)


    ufc_spider(2017)


这是我可以获取网站网址的输出,但它不允许我从第二个网址获取数据:

http://mmadecisions.com/decisions-by-event/2013/
http://mmadecisions.com/decision/4801/John-Maguire-vs-Phil-Mulpeter

<html><head><title>Apache Tomcat/7.0.68 (Ubuntu) - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /decision/4801/John-Maguire-vs-Phil-Mulpeter%0D%0A</h1><hr noshade="noshade" size="1"/><p><b>type</b> Status report</p><p><b>message</b> <u>/decision/4801/John-Maguire-vs-Phil-Mulpeter%0D%0A</u></p><p><b>description</b> <u>The requested resource is not available.</u></p><hr noshade="noshade" size="1"/><h3>Apache Tomcat/7.0.68 (Ubuntu)</h3></body></html>
http://mmadecisions.com/decision/4793/Amanda-English-vs-Slavka-Vitaly

<html><head><title>Apache Tomcat/7.0.68 (Ubuntu) - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /decision/4793/Amanda-English-vs-Slavka-Vitaly%0D%0A</h1><hr noshade="noshade" size="1"/><p><b>type</b> Status report</p><p><b>message</b> <u>/decision/4793/Amanda-English-vs-Slavka-Vitaly%0D%0A</u></p><p><b>description</b> <u>The requested resource is not available.</u></p><hr noshade="noshade" size="1"/><h3>Apache Tomcat/7.0.68 (Ubuntu)</h3></body></html>
http://mmadecisions.com/decision/4792/Chris-Boujard-vs-Peter-Queally
......


我尝试更改用户代理标头,尝试进行时间延迟,并使用VPN运行代码。没有一个在工作,并且都给出了相同的输出。
请帮忙!

最佳答案

import requests
from bs4 import BeautifulSoup

links = []
for item in range(2013, 2020):
    print(f"{'-'*30}Extracting Year# {item}{'-'*30}")
    r = requests.get(f"http://mmadecisions.com/decisions-by-event/{item}/")
    soup = BeautifulSoup(r.text, 'html.parser')
    for item in soup.findAll('a', {'href': True}):
        item = item.get('href')
        if item.startswith('event'):
            print(f"http://mmadecisions.com/{item}")
            links.append(f"http://mmadecisions.com/{item}")

print("\nNow Fetching all urls inside Years..\n")

for item in links:
    r = requests.get(item)
    soup = BeautifulSoup(r.text, 'html.parser')
    for item in soup.findAll('a', {'href': True}):
        item = item.get('href')
        if item.startswith('decision/'):
            print(f"http://mmadecisions.com/{item}".strip())


在线运行代码:Click Here

请注意,您可以使用以下命令:

for item in soup.findAll('td', {'class': 'list'}):
    for an in item.findAll('a'):
        print(an.get('href'))




for item in soup.findAll('td', {'class': 'list2'}):
    for an in item.findAll('a'):
        print(an.get('href').strip())

关于python - 网站使用我的一半网页抓取代码,但另一半给出错误消息,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59331941/

10-12 20:07