我正在尝试从以下网站提取表数据:https://msih.bgu.ac.il/md-program/residency-placements/

虽然没有表标签,但我发现通用标签可以将表的各个部分拉到div class = accord-con

我制作了一个字典,其中的键是毕业年份(即2019、2018等),值是每个div class-accord con中的html。

我被卡住了,不知道如何解析字典中的html。我的目标是每年分别列出专科,医院和地点。我不知道该如何前进。

下面是我的工作代码:

import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
    grad_yr_list.append(header.h2.text[-4:])

rez_classes = soup.find_all('div', class_={'accord-con'})

data_dict = dict(zip(grad_yr_list, rez_classes))


这是我的字典的示例:

{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
 '2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,


我的最终目标是将这些数据放入具有以下列的熊猫数据框中:研究生,专业,医院,位置

最佳答案

您的代码非常接近找到最终结果。将年份与学生安置数据配对后,只需将提取函数应用于后者。:

from bs4 import BeautifulSoup as soup
import re
from selenium import webdriver
_d = webdriver.Chrome('/path/to/chromedriver')
_d.get('https://msih.bgu.ac.il/md-program/residency-placements/')
d = soup(_d.page_source, 'html.parser')
def placement(block):
   r = block.find_all(re.compile('ul|h4'))
   return {r[i].text:[b.text for b in r[i+1].find_all('li')] for i in range(0, len(r)-1, 2)}

result = {i.h2.text:placement(i) for i in d.find_all('div', {'class':'accord-head'})}
print(result['Class of 2019'])


输出:

{'Anesthesiology': ['University at Buffalo School of Medicine, Buffalo, NY'], 'Emergency Medicine': ['Aventura Hospital, Aventura, Fl'], 'Family Medicine': ['Louisiana State University School of Medicine, New Orleans, LA', 'UT St Thomas Hospitals, Murfreesboro, TN', 'Sea Mar Community Health Center, Seattle, WA'], 'Internal Medicine': ['Oregon Health and Science University, Portland, OR', 'St Joseph Hospital, Denver, CO\xa0'], 'Obstetrics-Gynecology': ['Jersey City Medical Center, Jersey City, NJ', 'New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY'], 'Pediatrics': ['St Louis Children’s Hospital, St Louis, MO', 'University of Maryland Medical Center, Baltimore, MD', 'St Christopher’s Hospital, Philadelphia, PA'], 'Surgery': ['Mountain Area Health Education Center, Asheville, NC']}


注意:我最终使用了selenium,因为对我来说,从requests.get返回的HTML响应不包括渲染的学生排名数据。

关于python - 通过字典中的HTML进行解析,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59557831/

10-12 23:22