本文介绍了文章与beautifulsoup刮:刮所有< P>标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写的文章,从段拉并将其写入文件的脚本。对于一些文章,也不会拉扯每一个段落。这是我在哪里丢失。任何指导,将深深AP preciated。我已经包括了一个链接到它不拉的所有信息的特定文章。这一切都擦伤了,直到第一个引用一句。

网址:的

 #要求用户输入网址
URL =的raw_input(请输入一个有效的URL:)#输出打开txt文件
TXT =打开('ctp_output.txt','W')#解析文章的HTML
汤= BeautifulSoup(urllib2.urlopen(URL).read())#检索所有段落标记
标签=汤(P)
在标签标签:
    txt.write(tag.get_text()+'\\ n'+'\\ n')


解决方案

这对我来说是什么作品:

 进口的urllib2
从BS4进口BeautifulSoupURL =htt​​p://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306汤= BeautifulSoup(urllib2.urlopen(URL))开放('ctp_output.txt','W')为f:
    在soup.find_all('P')标签:
        f.write(tag.text.en code(UTF-8')+'\\ N')

请注意,随着文件的工作时,你应该使用上下文管理。您也可以通过 urllib2.urlopen(URL)直接将 BeautifulSoup 构造,因为的urlopen 返回一个类文件对象。

希望有所帮助。

I wrote a script that pulls paragraphs from articles and writes them to a file. For some articles, it won't pull every paragraph. This is where I am lost. Any guidance would be deeply appreciated. I have included a link to a particular article where it isn't pulling all of the information. It scrapes everything up until the first quoted sentence.

URL: http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article
soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
tags = soup('p')
for tag in tags:
    txt.write(tag.get_text() + '\n' + '\n')
解决方案

This is what works for me:

import urllib2
from bs4 import BeautifulSoup

url = "http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306"

soup = BeautifulSoup(urllib2.urlopen(url))

with open('ctp_output.txt', 'w') as f:
    for tag in soup.find_all('p'):
        f.write(tag.text.encode('utf-8') + '\n')

Note that you should use with context manager while working with files. Also you can pass urllib2.urlopen(url) directly to the BeautifulSoup constructor since urlopen returns a file-like object.

Hope that helps.

这篇关于文章与beautifulsoup刮:刮所有< P>标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 09:58
查看更多