我正在尝试为IUPAC,MIC和有机体菌株解析以下URL http://www.trimslabs.com/mic/300.htm。尽管无法找到一种将结果分组在一起的方法,但我在某种程度上已经做到了。这是到目前为止我得到的..

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
myurl = 'http://www.trimslabs.com/mic/300.htm'
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
#grab IUPACs
tables = page_soup.findAll("table")
table = tables[0]
IUPACS = []
for i in range (1, 454, 3):
    IUPACs = tables[i].find(text = "IUPAC").findNext('td').get_text(",",     strip = True)
    print(IUPACs)
for i in range (455, 661, 3):
    IUPACs_two = tables[i].find(text = "IUPAC").findNext('td').get_text(",", strip = True)
    print(IUPACs_two)
#grab organism names
organism_list = page_soup.findAll("i")
org = organism_list[1]
for org in organism_list:
    organism = org.text
    print(organism)
#get the MIC numbers
for org in organism_list:
    numbers = org.findNext('td').get_text(",", strip = True)
    print(numbers)


这将打印出我想要的大部分内容,但是我完全失去了与它们相关的抗生素(IUPAC)编号的信息。意识到每种抗生素都有3张桌子,我还尝试了以下方法

chem_tables = []
name_tables = []
org_tables = []
results_tables = []
for i in range (0, 451, 3):
    # 1.  Establish three tables per document
    chem_tables.append(tables[i])
    name_tables.append(tables[i + 1].find(text = "IUPAC").findNext('td').get_text(",", strip = True))
    org_tables.append(tables[i + 2].findAll("i"))
    results_tables.append(tables[i + 2].findAll("i").findNext('td'))


这很不错,因为现在chem_tables[0]org_tables[0]name_tables[0]都引用一种药物,但是我一生无法在不丢失有关哪种药物的信息的情况下弄清楚如何从org_tables中删除​​各个生物名称。他们与...相关

我已经在这个问题上两天撞墙了。任何帮助将不胜感激。

最佳答案

我将这样处理:

1)找到IUPAC单元格;

2)获得价值;

3)从IUPAC单元中找到最近的表;

4)找到所有表行,并跳过前两行和最后一行(无用数据);

5)对于第二行单元格中的每一行,找到所有font标记以获取Organism值,然后;

6)从第三行单元格获取每个值以获取MIC值;

7)从5)获取每个值并存储到列表中

8)用逗号分隔6)并存储到列表中

9)将所有内容加入字典;

示例代码:

from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.trimslabs.com/mic/300.htm')

soup = BeautifulSoup(response.content, "html.parser")

MicDatabase = []

for IUPAC in soup.find_all(text="IUPAC"):
    Value = IUPAC.find_next('td').get_text(",", strip = True)

    for tr in IUPAC.find_next('table').find_all("tr")[2:-1]:
        td = tr.find_all("td")[1:]

        Organism = td[0].find_all("font")
        MIC = td[1].get_text(",", strip = True)

    MicDatabase.append(
        {
            "IUPAC": Value,
            "ActivityData": {"Organism": [o.get_text(" ", strip=True) for o in Organism], "MIC": MIC.split(',')}
        })


哪个输出:

[{'ActivityData': {'MIC': [u'2-4', u'1-2', u'1-2', u'1-2', u'2-4', u'2-4', u'2-4', u'1-2', u'>16', u'2-4', u'1-2', u'0.25 - 0.5', u'0.25 - 0.5'], 'Organism': [u'B. pumilus ATCC 14348', u'S. epidermidis ATCC 155', u'E. faecalis ATCC 35550', u'S. aureus ATCC 25923', u'S. aureus ATCC 9144', u'S. aureus ATCC 14154', u'S. aureus ATCC 29213', u'S. aureus ATCC 700699', u'(methicillin-resistant)', u'S. aureus NRS 119', u'(linezolid-resistant)', u'E.faecalis ATCC 14506', u'E.faecalis ATCC 700802', u'(vancomycin-resistant)', u'S.pyogenes ATCC 14289', u'S.pneumoniae ATCC 700904', u'(penicillin-resistant)']}, 'IUPAC': u'2-[(S)-3-(3-Fluoro-4-morpholin-4-yl-phenyl)-2-oxo-oxazolidin-5-yl]-acetamide'}...

关于python - 如何在没有类的情况下解析表并保持分组,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/43131267/

10-12 21:48