问题描述
这部分html代码:
html3 =< a name =definition><<<<< ; / a>
pre>
< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a标题=链接到此处class =self-link =valueset-investigation>< img src =ta.png/>< / a>< / h2>
< hr>
< div><< ; p来自以下< / p>< ul>< li>包含http://snomed.info/sct<table><tr><td><b>代码< / td>< / td>< td>< b> Display< / b>< / td>< / tr>< tr>< td> 34353553< / td>< td&标志< / TD>< TD />< / TR>< TR>< TD> 35453453453< / TD>< TD>历史/症状< / TD>< TD />< / TR> ;< / table>< / li>< / ul>< / div>
< p>< / p>
我将使用beautifulsoup来查找h2,其文本等于C意图逻辑定义和下一个兄弟姐妹。但美丽的女孩找不到h2。以下是我的代码:
soup = BeautifulSoup(html3,lxml)
f = soup.find(这是一个错误:AttributeError:'NoneType'对象没有属性'nextsibilings'
文本中有几个h2,但唯一使h2独一无二的字符是内容逻辑定义。找到这个h2后,我将从表格中提取数据并在其下面列出。
解决方案主要问题在于您定位
h2
元素的方式从中找到兄弟姐妹。我会使用,而不是检查Content Logical Definition
在文本中:
soup.find(lambda elm:elm .name ==h2和Content Logical Definitionin elm.text)
获得下一个兄弟姐妹,你应该使用而不是
nextsibilings
。
演示:
>>> from bs4 import BeautifulSoup
>>> html3 =< a name =definition>< / a>
...< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a title =链接到此处class =self-linkhref =valueset-investigation>< img src =ta.png/>< / a>< / h2>
...< hr />
...< div>< p from ...< p>< / p>
>>>汤= BeautifulSoup(html3,lxml)
>>> h2 = soup.find(elm.text中的lambda elm:elm.name ==h2和Content Logical Definition)
>>>为兄弟在h2.next_siblings:
...打印(兄弟姐妹)
...
< hr />
< div>< p following =from =the =>< / p>< ul>< li>包括http:// snomed中定义的这些代码。方式/ SCT<表>< TR>< TD>< b取代;代码< / b>< / TD>< TD>< b取代;显示< / b>< / TD>< / TR> ;< tr>< td> 34353553< / td>< td>检查/符号< / td>< td>< / td>< / tr>< tr>< TD>< TD>历史/症状< / TD>< TD>< / TD>< / TR>< /表>< /立GT;< / UL>< / DIV>
< p> < / p为H.
虽然现在知道你正在处理的HTML我认为你应该迭代兄弟姐妹,打破下一个
h2
或者如果你发现一个表
之前。实际执行:
从bs4导入请求
导入BeautifulSoup
url = [
'https://www.hl7.org/fhir/valueset-activity-reason.html',
'https://www.hl7.org/fhir/valueset-age-units.html'
在url中的网址:
r = requests.get(url)
汤= BeautifulSoup(r.content,'lxml')
h2 = soup.find(lambda elm:elm.name ==h2和Content Logical Definitionin elm.text)
table = None
在h2.find_next_siblings()中用于同级:
如果sibling.name ==table:
table = sibling
break
如果sibling.name ==h2:
break
print (表)
for this part of html code:
html3= """<a name="definition"> </a> <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2> <hr/> <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div> <p> </p>"""
I am going to use beautifulsoup to find h2 that its text equals to "Content Logical Definition" and next siblings. But beautifulsoup can not find h2. The following is my code:
soup = BeautifulSoup(html3, "lxml") f= soup.find("h2", text = "Content Logical Definition").nextsibilings
This is an error:
AttributeError: 'NoneType' object has no attribute 'nextsibilings'
There are several "h2" in the text, but the only character that makes this h2 unique is "Content Logical Definition". After finding this h2, I am going to extract data from the table and list under it.
解决方案The main problem is the way you are locating the
h2
element to find siblings from. I'd use a function instead checking thatContent Logical Definition
is inside the text:soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
Also, to get the next siblings you should use the
.next_siblings
and notnextsibilings
.Demo:
>>> from bs4 import BeautifulSoup >>> html3= """<a name="definition"> </a> ... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2> ... <hr/> ... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div> ... <p> </p>""" >>> soup = BeautifulSoup(html3, "lxml") >>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text) >>> for sibling in h2.next_siblings: ... print(sibling) ... <hr/> <div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div> <p> </p>
Though, now knowing the real HTML you are dealing with and how messed up can it be, I think you should be iterating over the siblings, break on the next
h2
or if you find atable
before that. Actual implementation:import requests from bs4 import BeautifulSoup urls = [ 'https://www.hl7.org/fhir/valueset-activity-reason.html', 'https://www.hl7.org/fhir/valueset-age-units.html' ] for url in urls: r = requests.get(url) soup = BeautifulSoup(r.content, 'lxml') h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text) table = None for sibling in h2.find_next_siblings(): if sibling.name == "table": table = sibling break if sibling.name == "h2": break print(table)
这篇关于使用bs4来查找具有文本的html标签(h2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!