我在使用以下代码从网站提取的属性中的元素中提取特定值时遇到了一些麻烦:
from bs4 import BeautifulSoup
import requests
# Get mills and estates information from dashboard
url = 'http://nestetraceabilitydashboard.com/nestes-palm-oil-dashboard'
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")
divList = soup.findAll('div', attrs={"class" : "map-item estate-map-item"})
data = {}
for div in divList:
for k,v in div.attrs.items():
if k not in ('class'):
data[k] = data.get(k, []) + [v]
df = pd.DataFrame(data)
divList
的摘录如下:[<div class="map-item estate-map-item" data-country="Indonesia" data-latitude="1.926944000" data-location="Riau" data-longitude="99.906390000" data-mills="Aek Nabara" id="map_item_5600">(Aek Nabara) - Aek Nabara</div>,
<div class="map-item estate-map-item" data-country="Indonesia" data-latitude="0.429444444" data-location="Riau" data-longitude="101.818611100" data-mills="Buatan I " id="map_item_5601">(Buatan I/II ) - Buatan</div>,
但是,输出
dict
和dataframe
删除id
中map_item_XXXX之后的所有内容。我该如何只在
dict
中的引号之外获取值,然后又将其添加到dataframe
id
列中,例如上述(Aek Nabara) - Aek Nabara
中第一项的divList
? 最佳答案
(Aek Nabara) - Aek Nabar
不是属性(.attrs)
,但是textContent
使用.text
获取值
for div in divList:
for k,v in div.attrs.items():
if k != 'class':
if k == 'id':
# insert "(Aek Nabara) - Aek Nabara" instead of "map_item_5600"
data[k] = data.get(k, []) + [div.text.strip()]
else:
data[k] = data.get(k, []) + [v]
df = pd.DataFrame(data)