我在使用以下代码从网站提取的属性中的元素中提取特定值时遇到了一些麻烦:

from bs4 import BeautifulSoup
import requests

# Get mills and estates information from dashboard
url = 'http://nestetraceabilitydashboard.com/nestes-palm-oil-dashboard'
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")

divList = soup.findAll('div', attrs={"class" : "map-item estate-map-item"})
data = {}
for div in divList:
    for k,v in div.attrs.items():
        if k not in ('class'):
            data[k] = data.get(k, []) + [v]

df = pd.DataFrame(data)


divList的摘录如下:

[<div class="map-item estate-map-item" data-country="Indonesia" data-latitude="1.926944000" data-location="Riau" data-longitude="99.906390000" data-mills="Aek Nabara" id="map_item_5600">(Aek Nabara) - Aek Nabara</div>,
 <div class="map-item estate-map-item" data-country="Indonesia" data-latitude="0.429444444" data-location="Riau" data-longitude="101.818611100" data-mills="Buatan I " id="map_item_5601">(Buatan I/II ) - Buatan</div>,


但是,输出dictdataframe删除id中map_item_XXXX之后的所有内容。

我该如何只在dict中的引号之外获取值,然后又将其添加到dataframe id列中,例如上述(Aek Nabara) - Aek Nabara中第一项的divList

最佳答案

(Aek Nabara) - Aek Nabar不是属性(.attrs),但是textContent使用.text获取值

for div in divList:
    for k,v in div.attrs.items():
        if k != 'class':
            if k == 'id':
                # insert "(Aek Nabara) - Aek Nabara" instead of "map_item_5600"
                data[k] = data.get(k, []) + [div.text.strip()]
            else:
                data[k] = data.get(k, []) + [v]

df = pd.DataFrame(data)

08-15 22:31