BeautifulSoup用句号和空格替换换行符

BeautifulSoup用句号和空格替换换行符

本文介绍了BeautifulSoup用句号和空格替换换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取与BeautifulSoap的一些链接.

I am scraping a few links with BeautifulSoap.

这是我要删除的URL的源代码的相关部分:

Here is the relevant portion of source code of the URL I am scrapping:

<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>

这是我的BeautifulSoap代码(仅相关部分),用于在description标签中获取文本:

Here is my BeautifulSoap code (relevant part only) to get the text within description tags:

quote_page = sys.argv[1]
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

description_box = soup.find('div', {'class':'description'})
description = description_box.get_text(separator=" ").strip()
print description

使用 python script.py https://example.com/page/2000 提供以下输出:

Running the script using python script.py https://example.com/page/2000 gives the following output:

Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

如何将换行符替换为句点和空格,使其看起来像以下内容:

How can I replace the line-break with a period followed by a space so it looks like the following:

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

有什么想法可以做到吗?

Any ideas how I can do that?

推荐答案

来自:

html = '''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
n = 2                                # occurrence i.e. 2nd in this case
sep = '\n'                           # sep i.e. newline
cells = html.split(sep)


from bs4 import BeautifulSoup

html = sep.join(cells[:n]) + ". " + sep.join(cells[n:])
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('div', attrs={'class': 'description'})
title = title_box.get_text().strip()
print (title)

输出:

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

编辑:

from bs4 import BeautifulSoup

page = requests.get("https://blablabla.com")
soup = BeautifulSoup(page.content, 'html.parser')
description_box  = soup.find('div', attrs={'class': 'description'})
description = description_box.get_text().strip()

n = 2                                # occurrence i.e. 2nd in this case
sep = '\n'                           # sep i.e. newline
cells = description.split(sep)
desired = sep.join(cells[:n]) + ". " + sep.join(cells[n:])

print (desired)

这篇关于BeautifulSoup用句号和空格替换换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 18:49