问题描述
我写的文章,从段拉并将其写入文件的脚本。对于一些文章,也不会拉扯每一个段落。这是我在哪里丢失。任何指导,将深深AP preciated。我已经包括了一个链接到它不拉的所有信息的特定文章。这一切都擦伤了,直到第一个引用一句。
网址:的
#要求用户输入网址
URL =的raw_input(请输入一个有效的URL:)#输出打开txt文件
TXT =打开('ctp_output.txt','W')#解析文章的HTML
汤= BeautifulSoup(urllib2.urlopen(URL).read())#检索所有段落标记
标签=汤(P)
在标签标签:
txt.write(tag.get_text()+'\\ n'+'\\ n')
这对我来说是什么作品:
进口的urllib2
从BS4进口BeautifulSoupURL =http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306汤= BeautifulSoup(urllib2.urlopen(URL))开放('ctp_output.txt','W')为f:
在soup.find_all('P')标签:
f.write(tag.text.en code(UTF-8')+'\\ N')
请注意,随着文件的工作时,你应该使用与
上下文管理。您也可以通过 urllib2.urlopen(URL)
直接将 BeautifulSoup
构造,因为的urlopen
返回一个类文件对象。
希望有所帮助。
I wrote a script that pulls paragraphs from articles and writes them to a file. For some articles, it won't pull every paragraph. This is where I am lost. Any guidance would be deeply appreciated. I have included a link to a particular article where it isn't pulling all of the information. It scrapes everything up until the first quoted sentence.
URL: http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306
# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")
# Open txt document for output
txt = open('ctp_output.txt', 'w')
# Parse HTML of article
soup = BeautifulSoup(urllib2.urlopen(url).read())
# retrieve all of the paragraph tags
tags = soup('p')
for tag in tags:
txt.write(tag.get_text() + '\n' + '\n')
This is what works for me:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306"
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'w') as f:
for tag in soup.find_all('p'):
f.write(tag.text.encode('utf-8') + '\n')
Note that you should use with
context manager while working with files. Also you can pass urllib2.urlopen(url)
directly to the BeautifulSoup
constructor since urlopen
returns a file-like object.
Hope that helps.
这篇关于文章与beautifulsoup刮:刮所有< P>标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!