python - 在元素上缺少类/id的情况下收集数据

我试图抓取数据来构建一个看起来像的对象；

{
    "artist": "Oasis",
        "albums": {
            "Definitely Maybe": [
                "Rock n Roll Star",
                "Shakermaker",
                ...
            ],

            "(What's The Story) Morning Glory": [
                "Hello",
                "Roll With It"
                ...
            ],
            ...
        }
}

Here is how the HTML on the page looks;

我目前正在像这样废弃数据；

data = []
for div in soup.find_all("div",{"id":"listAlbum"}):
    links = div.findAll('a')
    for a in links:
        if a.text.strip() is "":
            pass
        elif a.text.strip():
            data.append(a.text.strip())

同样，获取专辑名称也很简单。

for div in soup.find_all("div",{"class":"album"}):
    titles = div.findAll('b')
    for t in titles:
        ...

我的问题是如何使用以上两个循环来构建一个对象，如顶部的对象。如何确保X专辑中的歌曲进入正确的专辑对象。如果每首歌都有一个album属性，那对我来说很清楚。但是，以HTML的结构形式-我有点茫然。

编辑：找到下面的HTML；

<div id="listAlbum">
  <a id="1368"></a>
  <div class="album">album: <b>"Definitely Maybe"</b> (1994)</div>
  <a href="../lyrics/oasis/rocknrollstar.html" target="_blank">Rock 'n' Roll Star</a><br>
  <a href="../lyrics/oasis/shakermaker.html" target="_blank">Shakermaker</a><br>
  <a href="../lyrics/oasis/liveforever.html" target="_blank">Live Forever</a><br>
  <a href="../lyrics/oasis/upinthesky.html" target="_blank">Up In The Sky</a><br>
  <a href="../lyrics/oasis/columbia.html" target="_blank">Columbia</a><br>
  <a href="../lyrics/oasis/supersonic.html" target="_blank">Supersonic</a><br>
  <a href="../lyrics/oasis/bringitondown.html" target="_blank">Bring It On Down</a><br>
  <a href="../lyrics/oasis/cigarettesalcohol.html" target="_blank">Cigarettes &amp; Alcohol</a><br>
  <a href="../lyrics/oasis/digsysdiner.html" target="_blank">Digsy's Diner</a><br>
  <a href="../lyrics/oasis/slideaway.html" target="_blank">Slide Away</a><br>
  <a href="../lyrics/oasis/marriedwithchildren.html" target="_blank">Married With Children</a><br>
  <a href="../lyrics/oasis/sadsong.html" target="_blank">Sad Song</a><br>

  <a id="1366"></a>
  <div class="album">album: <b>"(What's The Story) Morning Glory"</b> (1995)</div>
  <a href="../lyrics/oasis/hello.html" target="_blank">Hello</a><br>
  <a href="../lyrics/oasis/rollwithit.html" target="_blank">Roll With It</a><br>
  <a href="../lyrics/oasis/wonderwall.html" target="_blank">Wonderwall</a><br>
  <a href="../lyrics/oasis/dontlookbackinanger.html" target="_blank">Don't Look Back In Anger</a><br>
  <a href="../lyrics/oasis/heynow.html" target="_blank">Hey Now</a><br>
  <a href="../lyrics/oasis/somemightsay.html" target="_blank">Some Might Say</a><br>
  <a href="../lyrics/oasis/castnoshadow.html" target="_blank">Cast No Shadow</a><br>
  <a href="../lyrics/oasis/sheselectric.html" target="_blank">She's Electric</a><br>
  <a href="../lyrics/oasis/morningglory.html" target="_blank">Morning Glory</a><br>
  <a href="../lyrics/oasis/champagnesupernova.html" target="_blank">Champagne Supernova</a><br>
  <a href="../lyrics/oasis/boneheadsbankholiday.html" target="_blank">Bonehead's Bank Holiday</a><br>

最佳答案

您可以使用find_next_siblings()进行此操作。

码：

oasis = {
    'artist': 'Oasis',
    'albums': {}
}

soup = BeautifulSoup(html, 'lxml')  # where html is the html you've provided
all_albums = soup.find('div', id='listAlbum')

first_album = all_albums.find('div', class_='album')
album_name = first_album.b.text
songs = []

for tag in first_album.find_next_siblings(['a', 'div']):
    # If tag is <div> add the previous album.
    if tag.name == 'div':
        oasis['albums'][album_name] = songs
        songs = []
        album_name = tag.b.text

    # If tag is <a> append song to the list.
    else:
        songs.append(tag.text)

# Add the last album
oasis['albums'][album_name] = songs

print(oasis)

输出：

{
    'artist': 'Oasis',
    'albums': {
        '"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song', ''],
        '"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"]
    }
}

编辑：

检查网站后，我对代码做了一些更改。

首先，您需要跳过此<a id="6910"></a>标签（位于每个专辑的末尾），因为它会添加一首空名称的歌曲。其次，文本other songs:不在<b>标记内；因此，它会引发album_name = tag.b.text错误。

进行以下更改将为您提供所需的确切信息。

for tag in first_album.find_next_siblings(['a', 'div']):
    if tag.name == 'div':
        oasis['albums'][album_name] = songs
        songs = []
        album_name = tag.text if tag.text == 'other songs:' else tag.b.text
        continue
    if tag.get('id'):
        continue
    songs.append(tag.text)

最终输出：

{
    'artist': 'Oasis',
    'albums': {
        '"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song'],
        '"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"],
        '"Be Here Now"': ["D'You Know What I Mean?", 'My Big Mouth', 'Magic Pie', 'Stand By Me', 'I Hope, I Think, I Know', 'The Girl In The Dirty Shirt', 'Fade In-Out', "Don't Go Away", 'Be Here Now', 'All Around The World', "It's Getting Better (Man!!)"],
        '"The Masterplan"': ['Acquiesce', 'Underneath The Sky', 'Talk Tonight', 'Going Nowhere', 'Fade Away', 'I Am The Walrus (Live)', 'Listen Up', "Rockin' Chair", 'Half The World Away', "(It's Good) To Be Free", 'Stay Young', 'Headshrinker', 'The Masterplan'],
        '"Standing On The Shoulder Of Giants"': ["Fuckin' In The Bushes", 'Go Let It Out', 'Who Feels Love?', 'Put Yer Money Where Yer Mouth Is', 'Little James', 'Gas Panic!', 'Where Did It All Go Wrong?', 'Sunday Morning Call', 'I Can See A Liar', 'Roll It Over'],
        '"Heathen Chemistry"': ['The Hindu Times', 'Force Of Nature', 'Hung In A Bad Place', 'Stop Crying Your Heart Out', 'Song Bird', 'Little By Little', '(Probably) All In The Mind', 'She Is Love', 'Born On A Different Cloud', 'Better Man'],
        '"Don\'t Believe The Truth"': ['Turn Up The Sun', 'Mucky Fingers', 'Lyla', 'Love Like A Bomb', 'The Importance Of Being Idle', 'The Meaning Of Soul', "Guess God Thinks I'm Abel", 'Part Of The Queue', 'Keep The Dream Alive', 'A Bell Will Ring', 'Let There Be Love'],
        '"Dig Out Your Soul"': ['Bag It Up', 'The Turning', 'Waiting For The Rapture', 'The Shock Of The Lightning', "I'm Outta Time", '(Get Off Your) High Horse Lady', 'Falling Down', "To Be Where There's Life", "Ain't Got Nothin'", 'The Nature Of Reality', 'Soldier On', 'I Believe In All'],
        'other songs:': ["(As Long As They've Got) Cigarettes In Hell", '(I Got) The Fever', 'Alice', 'Alive', 'Angel Child', 'Boy With The Blues', 'Carry Us All', 'Cloudburst', 'Cum On Feel The Noize', "D'Yer Wanna Be A Spaceman", 'Eyeball Tickler', 'Flashbax', 'Full On', 'Helter Skelter', 'Heroes', 'I Will Believe', "Idler's Dream", 'If We Shadows', "It's Better People", 'Just Getting Older', "Let's All Make Believe", 'My Sister Lover', 'One Way Road', 'Round Are Way', 'Step Out', 'Street Fighting Man', 'Take Me', 'Take Me Away', 'The Fame', 'Whatever', "You've Got To Hide Your Love Away"]
    }
}

关于python - 在元素上缺少类/id的情况下收集数据，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/48994782/