文章与beautifulsoup刮：刮所有＆LT; P＆GT;标签

本文介绍了文章与beautifulsoup刮：刮所有＆LT; P＆GT;标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我写的文章，从段拉并将其写入文件的脚本。对于一些文章，也不会拉扯每一个段落。这是我在哪里丢失。任何指导，将深深AP preciated。我已经包括了一个链接到它不拉的所有信息的特定文章。这一切都擦伤了，直到第一个引用一句。

网址：的

 ＃要求用户输入网址
URL =的raw_input（请输入一个有效的URL：）＃输出打开txt文件
TXT =打开（'ctp_output.txt'，'W'）＃解析文章的HTML
汤= BeautifulSoup（urllib2.urlopen（URL）.read（））＃检索所有段落标记
标签=汤（P）
在标签标签：
    txt.write（tag.get_text（）+'\\ n'+'\\ n'）

解决方案

这对我来说是什么作品：

 进口的urllib2
从BS4进口BeautifulSoupURL =http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306汤= BeautifulSoup（urllib2.urlopen（URL））开放（'ctp_output.txt'，'W'）为f：
    在soup.find_all（'P'）标签：
        f.write（tag.text.en code（UTF-8'）+'\\ N'）

请注意，随着文件的工作时，你应该使用与上下文管理。您也可以通过 urllib2.urlopen（URL）直接将 BeautifulSoup 构造，因为的urlopen 返回一个类文件对象。

希望有所帮助。

I wrote a script that pulls paragraphs from articles and writes them to a file. For some articles, it won't pull every paragraph. This is where I am lost. Any guidance would be deeply appreciated. I have included a link to a particular article where it isn't pulling all of the information. It scrapes everything up until the first quoted sentence.

URL: http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article
soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
tags = soup('p')
for tag in tags:
    txt.write(tag.get_text() + '\n' + '\n')

解决方案

This is what works for me:

import urllib2
from bs4 import BeautifulSoup

url = "http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306"

soup = BeautifulSoup(urllib2.urlopen(url))

with open('ctp_output.txt', 'w') as f:
    for tag in soup.find_all('p'):
        f.write(tag.text.encode('utf-8') + '\n')

Note that you should use with context manager while working with files. Also you can pass urllib2.urlopen(url) directly to the BeautifulSoup constructor since urlopen returns a file-like object.

Hope that helps.

这篇关于文章与beautifulsoup刮：刮所有＆LT; P＆GT;标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！