本文介绍了用Python计数HTML图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要一些关于如何提取Python 3.01后对HTML图像进行计数的反馈,也许我的正则表达式未正确使用.
I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly.
这是我的代码:
import re, os
import urllib.request
def get_image(url):
url = 'http://www.google.com'
total = 0
try:
f = urllib.request.urlopen(url)
for line in f.readline():
line = re.compile('<img.*?src="(.*?)">')
if total > 0:
x = line.count(total)
total += x
print('Images total:', total)
except:
pass
推荐答案
关于您的代码的几点:
- 使用专用的HTML解析库来解析您的页面非常容易(这是python方式)..我个人更喜欢美丽的汤
- 您正在循环中覆盖
line
变量 根据您当前的逻辑, -
total
始终为0 - 无需编译您的RE,因为它将被缓存解释器
- 您要丢弃异常,因此不知道代码中发生了什么!
-
<img>
标记可能还有其他属性..因此,您的Regex有点基础,同样,请使用re.findall()
方法在同一行上捕获多个实例...
- It's much easiser to use a dedicated HTML parsing library to parse your pages (that's the python way).. I personally prefer Beautiful Soup
- You're over-writing your
line
variable in the loop total
will always be 0 with your current logic- no need to compile your RE, as it will be cached by the interpreter
- you're discarding your exception, so no clues about what's going on in the code!
- there could be other attributes to the
<img>
tags.. so your Regex is a little basic, also, use there.findall()
method to catch multiple instances on the same line...
稍微更改一下代码,我得到:
changing your code around a little, I get:
import re
from urllib.request import urlopen
def get_image(url):
total = 0
page = urlopen(url).readlines()
for line in page:
hit = re.findall('<img.*?>', str(line))
total += len(hit)
print('{0} Images total: {1}'.format(url, total))
get_image("http://google.com")
get_image("http://flickr.com")
这篇关于用Python计数HTML图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!