本文介绍了如何从HTML页面提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,网页是链接:

我必须拥有公司的名称,地址和网站。我已经尝试过将html转换为文本:

 从urllib导入nltk 
导入urlopen

url =https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display = 50
html = urlopen(url).read()
raw = nltk.clean_html( html)
print(raw)

但它返回错误:

  ImportError:无法导入名称'urlopen 


彼得伍德已经回答了你的问题()。

  import urllib.request 

uf = urllib.request.urlopen(url)
html = uf.read()

但是如果你想提取数据(比如公司名称,地址和网站),那么您将需要获取您的HTML源代码并使用HTML解析器进行解析。



我建议使用用于获取HTML源代码和来解析生成的HTML并提取您需要的文本。 / p>

这是一个小小的snipet,它会给你一个开始。

 从bs4导入请求
导入BeautifulSoup

link =https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

html = requests.get(link).text

如果你不想使用请求,那么你可以在
下面用urllib(上面的代码片段)使用下面的代码。
soup = BeautifulSoup(html,lxml)
res = soup.findAll(article,{class:listingItem})
$ b print(Company Name:+ r.find('a')。text)
print(Address:+ r.find(div,{ 'class':'address'})。text)
print(Website:+ r.find_all(div,{'class':'pageMeta-item'})[3] .text)


For example the web page is the link:

I must have the name of the firms and their address and website. I have tried the following to convert the html to text:

import nltk
from urllib import urlopen

url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)

But it returns the error:

ImportError: cannot import name 'urlopen
解决方案

Peter Wood has answered your problem (link).

import urllib.request

uf = urllib.request.urlopen(url)
html = uf.read()

But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.

I'd suggest to use requests for fetching the HTML source and BeautifulSoup to parse the HTML generated and extract the text you require.

Here is a small snipet which will give you a head start.

import requests
from bs4 import BeautifulSoup

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50"

html = requests.get(link).text

"""If you do not want to use requests then you can use the following code below
   with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("article", {"class": "listingItem"})
for r in res:
    print("Company Name: " + r.find('a').text)
    print("Address: " + r.find("div", {'class': 'address'}).text)
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)

这篇关于如何从HTML页面提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 13:12
查看更多