问题描述
我正在尝试抓取一些元素并返回网页上显示的文本.我相信我可以通过 css_selectors 和 xpaths 找到很好的元素,但我无法返回所需的文本.这是我的程序如下:
I am trying to scrape an a few elements and return the displayed text on the webpage. I believe I can find the elements fine through css_selectors and xpaths, but i cannot return the desired text. Here is my program below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time
import threading
import pandas as pd
threadLocal = threading.local()
def instantiate_chrome():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
options = webdriver.ChromeOptions()
options.add_argument('log-level=3')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
driver = webdriver.Chrome(executable_path = r'path/to/chrome', options = options)
setattr(threadLocal, 'driver', driver)
return driver
def search_stock(driver, stock):
search_url = r'https://www.forbes.com/search/?q=' + stock
driver.get(search_url)
time.sleep(2)
driver.find_element_by_xpath(r'/html/body/div[1]/main/div[1]/div[1]/div[4]/div/div[1]/div/div[1]/a[1]').click()
def get_q_score(stock, driver):
df = pd.DataFrame(columns = ['stock','overall_score','quality', 'momentum','growth','technicals'])
time.sleep(3)
overall_score = driver.find_element_by_css_selector(r'.q-factor-total .q-score-bar__grade-label').text
quality_score = driver.find_element_by_xpath(r'/html/body/div[1]/main/div/div[1]/div[4]/div[2]/div[2]/div[1]/div[2]/div[1]').text
return print('overall score is '+ overall_score, ' quality score is ' + quality_score)
def main(stock):
driver = instantiate_chrome()
print('attempting to get q score for ' + stock)
search_stock(driver, stock)
print('found webpage for ' + stock)
get_q_score(stock, driver)
main('AAPL')
我认为问题在于我试图通过 selenium 的 .text 方法抓取文本,但没有要抓取的文本.有什么想法吗?
I believe the issue to be that i am attempting to scrape the text via selenium's .text method, but there is no text to scrape. Any thoughts?
推荐答案
除了您提到的文本实际上不是 text
之外,您走在正确的道路上.这些 texts
实际上是由一个名为 content
的 CSS
属性渲染的,它只能与伪元素 :before一起使用code> 和
:after
.如果您有兴趣,可以在此处阅读它的工作原理.
You were on the right path except for the text that you mentioned aren't actually text
. These texts
are actually rendered by a CSS
property called the content
which can only be used with the pseudo-elements :before
and :after
. You can read here on how it works if you are interested.
文本呈现为图标;有时,组织会这样做,以避免合理的价值观被抹杀.但是,有一种方法(有点困难)可以解决这个问题.使用 Selenium
和 javascript
,您可以单独定位属性 content
的 CSS
值,其中包含您的值之后.
The text are rendered as icons; this is sometimes done by organizations to avoid sensible values being scraped. However, there is a way(somewhat hard) to get around this. Using Selenium
and javascript
you can individually target the CSS
values of the property content
in which it holds the values you are after.
研究了一个小时,这是获取所需值的最简单的pythonic
方法
Having looked into it for an hour this is simplest pythonic
way of getting the values you desire
overall_score = driver.execute_script("return [...document.querySelectorAll('.q-score-bar__grade-label')].map(div => window.getComputedStyle(div,':before').content)") #key line in the problem
代码简单地创建了一个 javascript
代码,它以元素的 classes
为目标,然后将 div
元素映射到 div
元素的值code>CSS 属性.这将返回一个列表
The code simply creates a javascript
code that targets the classes
of the elements and then maps the div
elements to the values of the CSS
properties.This returns a list
['"TOP BUY"', '"B"', '"B"', '"B"', '"A"']
值,按以下顺序对应
Q-Factor Score/质量/动力/增长/技术
要访问列表的值,您可以使用 for
循环和 indexing
来选择值.您可以在此处了解更多信息
To access the values of a list you can use a for
loop and indexing
to select the value. You can see more on that here
这篇关于Python - 从 Selenium 中的 ::before 伪元素上的 CSS 属性“content"获取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!