


  html3 =< a name =definition><<<<< ; / a> 
< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a标题=链接到此处class =self-link =valueset-investigation>< img src =ta.png/>< / a>< / h2>
< hr>
< div><< ; p来自以下< / p>< ul>< li>包含http://snomed.info/sct<table><tr><td><b>代码< / td>< / td>< td>< b> Display< / b>< / td>< / tr>< tr>< td> 34353553< / td>< td&标志< / TD>< TD />< / TR>< TR>< TD> 35453453453< / TD>< TD>历史/症状< / TD>< TD />< / TR> ;< / table>< / li>< / ul>< / div>
< p>< / p>


  soup = BeautifulSoup(html3,lxml)
f = soup.find(这是一个错误:




主要问题在于您定位 h2 元素的方式从中找到兄弟姐妹。我会使用,而不是检查 Content Logical Definition 在文本中:

  soup.find(lambda elm:elm .name ==h2和Content Logical Definitionin elm.text)

获得下一个兄弟姐妹,你应该使用而不是 nextsibilings


 >>> from bs4 import BeautifulSoup 
>>> html3 =< a name =definition>< / a>
...< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a title =链接到此处class =self-linkhref =valueset-investigation>< img src =ta.png/>< / a>< / h2>
...< hr />
...< div>< p from ...< p>< / p>
>>>汤= BeautifulSoup(html3,lxml)
>>> h2 = soup.find(elm.text中的lambda elm:elm.name ==h2和Content Logical Definition)
< hr />
< div>< p following =from =the =>< / p>< ul>< li>包括http:// snomed中定义的这些代码。方式/ SCT<表>< TR>< TD>< b取代;代码< / b>< / TD>< TD>< b取代;显示< / b>< / TD>< / TR> ;< tr>< td> 34353553< / td>< td>检查/符号< / td>< td>< / td>< / tr>< tr>< TD>< TD>历史/症状< / TD>< TD>< / TD>< / TR>< /表>< /立GT;< / UL>< / DIV>
< p> < / p为H.

虽然现在知道你正在处理的HTML我认为你应该迭代兄弟姐妹,打破下一个 h2 或者如果你发现一个之前。实际执行:


url = [

r = requests.get(url)
汤= BeautifulSoup(r.content,'lxml')

h2 = soup.find(lambda elm:elm.name ==h2和Content Logical Definitionin elm.text)
table = None
如果sibling.name ==table:
table = sibling
如果sibling.name ==h2:
print (表)

for this part of html code:

html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""

I am going to use beautifulsoup to find h2 that its text equals to "Content Logical Definition" and next siblings. But beautifulsoup can not find h2. The following is my code:

soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings

This is an error:

AttributeError: 'NoneType' object has no attribute 'nextsibilings'

There are several "h2" in the text, but the only character that makes this h2 unique is "Content Logical Definition". After finding this h2, I am going to extract data from the table and list under it.


The main problem is the way you are locating the h2 element to find siblings from. I'd use a function instead checking that Content Logical Definition is inside the text:

soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)

Also, to get the next siblings you should use the .next_siblings and not nextsibilings.


>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
...     print(sibling)
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>

Though, now knowing the real HTML you are dealing with and how messed up can it be, I think you should be iterating over the siblings, break on the next h2 or if you find a table before that. Actual implementation:

import requests
from bs4 import BeautifulSoup

urls = [

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
        if sibling.name == "h2":


10-28 12:58