问题描述
我正在从该网站上抓取新文章https://nypost.com/search/China+COVID-19/page/2/?orderby=relevance 我使用for循环获取每篇新闻文章的内容,但是我无法为每篇文章组合段落.我的目标是将每篇文章存储在一个字符串中,所有字符串都应存储在 myarticle 列表中.
I´m scraping the new articles from this site https://nypost.com/search/China+COVID-19/page/2/?orderby=relevanceI used for-loop to get the content of each news article, but I couldn´t able to combine paragraphs for each article. My goal is to store each article in a string, and all the strings should be stored in myarticle list.
当我打印(myarticle [0])时,它给了我所有的文章.我希望它能给我一篇文章.
When I print(myarticle[0]), it gives me all the articles. I expect it should give me one single article.
任何帮助将不胜感激!
for pagelink in pagelinks:
#get page text
page = requests.get(pagelink)
#parse with BeautifulSoup
soup = bs(page.text, 'lxml')
containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = containerr.find_all('p')
for paragraph in articletext:
#get the text only
text = paragraph.get_text()
paragraphtext.append(text)
#combine all paragraphs into an article
thearticle.append(paragraphtext)
# join paragraphs to re-create the article
myarticle = [''.join(article) for article in thearticle]
print(myarticle[0])
为澄清起见,完整的代码附在下面
For clarification purpose, the full code is attached below
def scrape(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
request = 0
urls = [f"{url}{x}" for x in range(1,2)]
params = {
"orderby": "relevance",
}
pagelinks = []
title = []
thearticle = []
paragraphtext = []
for page in urls:
response = requests.get(url=page,
headers=user_agent,
params=params)
# controlling the crawl-rate
start_time = time()
#pause the loop
sleep(randint(8,15))
#monitor the requests
request += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
clear_output(wait = True)
#throw a warning for non-200 status codes
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(request, response.status_code))
#Break the loop if the number of requests is greater than expected
if request > 72:
warn('Number of request was greater than expected.')
break
#parse the content
soup_page = bs(response.text, 'lxml')
#select all the articles for a single page
containers = soup_page.findAll("li", {'class': 'article'})
#scrape the links of the articles
for i in containers:
url = i.find('a')
pagelinks.append(url.get('href'))
#scrape the titles of the articles
for i in containers:
atitle = i.find(class_ = 'entry-heading').find('a')
thetitle = atitle.get_text()
title.append(thetitle)
for pagelink in pagelinks:
#get page text
page = requests.get(pagelink)
#parse with BeautifulSoup
soup = bs(page.text, 'lxml')
containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = containerr.find_all('p')
for paragraph in articletext:
#get the text only
text = paragraph.get_text()
paragraphtext.append(text)
#combine all paragraphs into an article
thearticle.append(paragraphtext)
# join paragraphs to re-create the article
myarticle = [''.join(article) for article in thearticle]
print(myarticle[0])
print(scrape('https://nypost.com/search/China+COVID-19/page/'))
推荐答案
您不断追加到现有列表[],并且列表不断增长,您需要在每个循环中清除它.
You keep appending to an existing list [], it keeps growing, you need to clear it every loop.
articletext = containerr.find_all('p')
for paragraph in articletext:
#get the text only
text = paragraph.get_text()
paragraphtext.append(text)
#combine all paragraphs into an article
thearticle.append(paragraphtext)
# join paragraphs to re-create the article
myarticle = [''.join(article) for article in thearticle]
应该是
articletext = containerr.find_all('p')
thearticle = [] # clear from the previous loop
paragraphtext = [] # clear from the previous loop
for paragraph in articletext:
#get the text only
text = paragraph.get_text()
paragraphtext.append(text)
thearticle.append(paragraphtext)
myarticle.append(thearticle)
但是您可以将其简化为:
But you could simplify it more to:
article = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
myarticle.append(article.get_text())
这篇关于如何抓取网络新闻并将段落合并到每篇文章中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!