如何从HTML页面提取文本？

本文介绍了如何从HTML页面提取文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

例如，网页是链接：

我必须拥有公司的名称，地址和网站。我已经尝试过将html转换为文本：

 从urllib导入nltk 
导入urlopen 
 
 url =https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display = 50
 html = urlopen（url）.read（）
 raw = nltk.clean_html（ html）
 print（raw）

但它返回错误：

  ImportError：无法导入名称'urlopen

彼得伍德已经回答了你的问题（）。

import urllib.request uf = urllib.request.urlopen（url） html = uf.read（）
但是如果你想提取数据（比如公司名称，地址和网站），那么您将需要获取您的HTML源代码并使用HTML解析器进行解析。

我建议使用用于获取HTML源代码和来解析生成的HTML并提取您需要的文本。 / p>

这是一个小小的snipet，它会给你一个开始。

从bs4导入请求导入BeautifulSoup link =https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50 html = requests.get（link）.text 如果你不想使用请求，那么你可以在下面用urllib（上面的代码片段）使用下面的代码。 soup = BeautifulSoup（html，lxml） res = soup.findAll（article，{class：listingItem}） $ b print（Company Name：+ r.find（'a'）。text） print（Address：+ r.find（div，{ 'class'：'address'}）。text） print（Website：+ r.find_all（div，{'class'：'pageMeta-item'}）[3] .text）

For example the web page is the link:
I must have the name of the firms and their address and website. I have tried the following to convert the html to text:
import nltk from urllib import urlopen url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50" html = urlopen(url).read() raw = nltk.clean_html(html) print(raw)
But it returns the error:
ImportError: cannot import name 'urlopen
解决方案
Peter Wood has answered your problem (link).
import urllib.request uf = urllib.request.urlopen(url) html = uf.read()
But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.
I'd suggest to use requests for fetching the HTML source and BeautifulSoup to parse the HTML generated and extract the text you require.
Here is a small snipet which will give you a head start.
import requests from bs4 import BeautifulSoup link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50" html = requests.get(link).text """If you do not want to use requests then you can use the following code below with urllib (the snippet above). It should not cause any issue.""" soup = BeautifulSoup(html, "lxml") res = soup.findAll("article", {"class": "listingItem"}) for r in res: print("Company Name: " + r.find('a').text) print("Address: " + r.find("div", {'class': 'address'}).text) print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)

这篇关于如何从HTML页面提取文本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！