本文介绍了使用Python和beautifulsoup进行Web抓取:BeautifulSoup函数可以保存什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题遵循以下

因此,我建议使用Selenium作为解决方案,并尝试对网站进行基本刮擦.

这是我使用的代码:

从硒导入Webdriver的

 选项= webdriver.ChromeOptions()option.add_argument('-无头')option.binary_location = r'您的chrome.exe文件路径'浏览器= webdriver.Chrome(executable_path = r您的chromedriver.exe文件路径",options = option)browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")span_tags = browser.find_elements_by_tag_name('span')对于span_tags中的span_tag:打印(span_tag.text)browser.quit() 

这是输出:

此输出中存在一些垃圾数据,但这是供您确定所需和不需要的!

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class "c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:

from bs4 import BeautifulSoup
import urllib.request
import re

url = "https://www.tipico.de/de/live-wetten/"

try:
 page = urllib.request.urlopen(url)
except:
 print("An error occured.")

soup = BeautifulSoup(page, ‘html.parser’)

regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)

I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:

from bs4 import BeautifulSoup
import urllib.request
import re

url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"

try:
    page = urllib.request.urlopen(url)
except Exception as e:
    print(f"An error occurred: {e}")

soup = BeautifulSoup(page, 'html.parser')

regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)

The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line

print(soup)

When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?

EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.

解决方案

That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.

First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)

No <button> tag in the page source :

So I suggest using Selenium as the solution, and I tried a basic scrape of the website.

Here is the code I used :

from selenium import webdriver

option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'

browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)

browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")

span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
    print(span_tag.text)

browser.quit()

This is the output:

There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

这篇关于使用Python和beautifulsoup进行Web抓取:BeautifulSoup函数可以保存什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 17:06