问题描述
我正在尝试抓取此链接的数据:
I am trying to scrape the data for this link: page.
If you click the up arrow you will notice the highlighted days in the month sections. Clicking on a highlighted day, a table with initiated tenders for that day will appear. All I need to do is get the data in each table for each highlighted day in the calendar. There might be one or more tenders (up to max of 7) per day.
I have done some web scraping with bs4, however I think that this is a job for selenium (please, correct me if I am wrong) with which I am not very familiar.
So far, I have managed to find the arrow element by XPATH to navigate around the calendar and show me more months. After that I try clicking on a random day (in below code I clicked on 30.03.2020) upon which an html object called: "tenders-table cloned" appears in the html on inspect. The object name stays the same no matter what day you click on.
I am pretty stuck now, have tried to select by iterate and/or print what is inside that object table, it either says that object is not iterable or is None.
from selenium import webdriver
chrome_path = r"C:\Users\<name>\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.ibex.bg/bg/данни-за-пазара/централизиран-пазар-за-двустранни-договори/търговски-календар/")
driver.find_element_by_xpath("""//*[@id="content"]/div[3]/div/div[1]/div/i""").click()
driver.find_element_by_xpath("""//*[@id="content"]/div[3]/div/div[2]/div[1]/div[3]/table/tbody/tr[6]/td[1]""").click()
Please advice how I can proceed to extract the data from the table pop-up.
Well, i see there's no reason to use selenium
for such case as it's will slow down your task.
The website is loaded with JavaScript
event which render it's data dynamically once the page loads.
requests
library will not be able to render JavaScript
on the fly. so you can use selenium
or requests_html
. and indeed there's a lot of modules which can do that.
Now, we do have another option on the table, to track from where the data is rendered. I were able to locate the XHR request which is used to retrieve the data from the back-end
API
and render it to the users side.
import requests
import json
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
print(json.dumps(r, indent=4)) # to see it in nice format.
print(r.keys())
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Because am just a lazy coder: I will do it in this way:
import requests
import re
import pandas as pd
import ast
from datetime import datetime
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
matches = set(re.findall(r"tender_date': '([^']*)'", str(r)))
sort = (sorted(matches, key=lambda k: datetime.strptime(k, '%d.%m.%Y')))
print(f"Available Dates: {sort}")
opa = re.findall(r"({\'id.*?})", str(r))
convert = [ast.literal_eval(x) for x in opa]
df = pd.DataFrame(convert)
print(df)
df.to_csv("data.csv", index=False)
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Output: view-online
这篇关于网页抓取"onclick"消息使用python的网站上的对象表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!