问题描述
我不熟悉网络抓取,对于我正在从事的项目之一,我需要从交互式图表中检索一段时间内的比特币交易数据( https://bitinfocharts.com/comparison/bitcoin-transactions.html ).我发现我想要的所有数据都隐藏在855x455画布中,而不是直接在html文件中.但是,我可以以[new Date("2018/02/18"),159333]]的形式在Page源中找到那些数据.这是为什么?我该如何抓取这些数据?感谢您的帮助!
I am new of web scraping and for one of the project I am working on, I need to retrieve data of bitcoin transactions over time from an interactive chart (https://bitinfocharts.com/comparison/bitcoin-transactions.html) using Python 2.7. I found that all the data I want is hidden in the 855x455 canvas instead of directly in the html file. However, I could find those data in Page source in the form of [new Date("2018/02/18"),159333]]. Why is that? And how can I scrape those data? Appreciate for the help!
推荐答案
在查看html响应时,我发现有一个script标签,其中所有内容都添加到了Canvas中.
On looking the html response I found that there is a script tag with all the entires added in Canvas.
<script>
var gIsLog = 0;
var gIsZoomed = "";
var d;
$(function() {
$(".average").each(function() {
$(this).html('Average ' + $(this).html());
});
$(".simple").each(function() {
$(this).html('Simple ' + $(this).html());
});
$(".exponential").each(function() {
$(this).html('Exponential ' + $(this).html());
});
$(".weighted").each(function() {
$(this).html('Weighted ' + $(this).html());
});
$("#container").height(($(window).height() - 355 - $('#buttonsHDiv').height() > 200) ? $(window).height() - 355 - $('#buttonsHDiv').height() : 200);
$(window).resize(function() {
$("#container").height(($(window).height() - 355 - $('#buttonsHDiv').height() > 200) ? $(window).height() - 355 - $('#buttonsHDiv').height() : 200);
});
d = new Dygraph(document.getElementById("container"), [
[new Date("2009/01/03"), null],
[new Date("2009/01/04"), null],
[new Date("2009/01/05"), null],
[new Date("2009/01/06"), null],
[new Date("2009/01/07"), null],
[new Date("2009/01/08"), null],
借助这一事实,我设法使用正则表达式编写了以下代码.它可以满足您的需求.我解析了响应文本,然后找到包含所需数据的脚本标签,并对其应用了正则表达式.请看看.
With the help of this fact I managed to write below code using regex. It does what you want. I parsed response text and then found script tag with requried data and applied regex over it. Please have a look.
import os
import re
import requests
from bs4 import BeautifulSoup
url = 'https://bitinfocharts.com/comparison/bitcoin-transactions.html'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
script_tag = soup.findAll('script')[5]
script_text = script_tag.text
pattern = re.compile(r'\[new Date\("\d{4}/\d{2}/\d{2}"\),\d*\w*\]')
records = pattern.findall(script_text)
def parse_record(record):
date = record[11:21]
value = record[24:-1]
return [date,value]
transactions = []
for record in records:
transactions.append(parse_record(record))
这篇关于如何使用python从HTML canvas检索数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!