问题描述
我正在创建一个脚本,我试图在其中专门从网站翻录 m4a 文件.我目前正在为此目的使用 BS4 和 selenium.
我在获取信息时遇到了一些问题.文件链接不在页面的 HTML 源中.相反,我只能在控制台中找到它.我试图获得的链接在这张图片中(https://imgur.com/a/DLwcE0p) 标记为audio_url_m4a:".
这是我正在使用的一些示例代码:
from selenium import webdriver从 selenium.webdriver.common.desired_capabilities 导入 DesiredCapabilities\d = DesiredCapabilities.CHROMEd['loggingPrefs'] = {'browser':'ALL' }driver = webdriver.Chrome(r'chromedriver path', desired_capabilities = d)~~很多代码做其他与帖子无关的事情~~对于 audm_URL 中的 URL:#this 引用了我构建 URL 列表的一行代码driver.get(audm)时间.sleep(3)在 driver.get_log('browser') 中输入:打印(条目)
这是我得到的输出:
{'level': 'SEVERE', 'message': 'https://audm.herokuapp.com/favicon.ico - 加载资源失败:服务器响应状态为 404 (Not Found)', 'source':'网络','时间戳':1611291689357}{'level': 'SEVERE', 'message': 'https://cdn.segment.com/analytics.js/v1/5DOhLj2nIgYtQeSfn9YF5gpAiPqRtWSc/analytics.min.js - 无法加载资源:net::ERR_NAME_NOT_RESOLV来源':'网络','时间戳':1611291689357}大多数与从控制台获取内容相关的问题都指向我获取日志,但似乎没有任何内容让我知道如何获取其他变量.有什么想法吗?
这是我想从中获取文件的随机音频页面的链接:https://audm.herokuapp.com/player-embed?pub=纽约客&articleID=5fe0b9b09fabedf20ec1f70c
谢谢大家!
driver.get(https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,button"))).click()src=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".react-player video"))).get_attribute("src")打印(源代码)
如果你只是想得到 src 你可以使用上面的代码.
您需要导入
from selenium.webdriver.common.by import By从 selenium.webdriver.support 导入 expected_conditions 作为 EC从 selenium.webdriver.support.ui 导入 WebDriverWait
如果您想通过控制台日志获取它,请使用:它似乎仅适用于无头工作,我正在调查:
from selenium import webdriver从 selenium.webdriver.chrome.options 导入选项选项 = 选项()options.headless = 真能力 = webdriver.DesiredCapabilities().CHROME.copy()能力['loggingPrefs'] = {'浏览器':'ALL'}driver = webdriver.Chrome(options=options,desired_capabilities=capabilities)driver.maximize_window()时间.sleep(3)驱动程序.get(https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")在 driver.get_log('browser') 中输入:打印(条目)
更新
在无头模式 w3c 是假的,因此它正在工作,
对于非无头模式,您必须使用:
options.add_experimental_option('w3c', False)
I'm creating a script where I'm trying to rip m4a files from a website specifically. I'm using BS4 and selenium for this purpose presently.
I'm having some trouble getting the info. The file link is not located in the HTML source for the page. Instead, I can only find it in the console. The link I'm trying to get is here in this image (https://imgur.com/a/DLwcE0p) labeled "audio_url_m4a:".
Here's some sample code I'm using:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities\
d = DesiredCapabilities.CHROME
d['loggingPrefs'] = {'browser':'ALL ' }
driver = webdriver.Chrome(r'chromedriver path', desired_capabilities = d)
~~lots of code doing other things not relevant to the post~~
for URL in audm_URL: #this is referencing a line of code where I construct a list of URLs
driver.get(audm)
time.sleep(3)
for entry in driver.get_log('browser'):
print(entry)
Here is the output I get:
{'level': 'SEVERE', 'message': 'https://audm.herokuapp.com/favicon.ico - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1611291689357}
{'level': 'SEVERE', 'message': 'https://cdn.segment.com/analytics.js/v1/5DOhLj2nIgYtQeSfn9YF5gpAiPqRtWSc/analytics.min.js - Failed to load resource: net::ERR_NAME_NOT_RESOLVED', 'source': 'network', 'timestamp': 1611291689357}
Most questions relating to grabbing things from the console point me towards grabbing the logs, but nothing that seems to let me know how to grab those other variables. Any ideas?
Here's a link to a random audio page that I want to grab the file from:https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c
Thanks everyone!
driver.get(
"https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"button"))).click()
src=WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".react-player video"))).get_attribute("src")
print(src)
if you just want to get src you can use above code .
you need to import
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
If you want to get it through console log then use : IT SEEMS ITS WORKING ONLY FOR HEADLESS I AM INVESTIGATING:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
capabilities = webdriver.DesiredCapabilities().CHROME.copy()
capabilities['loggingPrefs'] = {'browser': 'ALL'}
driver = webdriver.Chrome(options=options,desired_capabilities=capabilities)
driver.maximize_window()
time.sleep(3)
driver.get(
"https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")
for entry in driver.get_log('browser'):
print(entry)
Update
in headless mode w3c is false and hence it is working ,
For non headless mode you have to use:
options.add_experimental_option('w3c', False)
这篇关于使用 Python 从控制台捕获信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!