lxml库

lxml是一个HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML数据。

基本使用：

1.我们可以利用他来解析HTML代码，并且在解析HTML代码的时候，如果HTML代码不规范，他会自动的进行补全。

#使用lxml的etree库

from lxml import etree

text = """

<div id="usrbar" alog-group="userbar" alog-alias="hunter-userbar-start"></div>

<ul id="header-link-wrapper" class="clearfix">

<li><a href="https://www.baidu.com/" data-path="s?wd=">网页</a></li>

<li style="margin-left:21px;"><span>新闻</span></li>

<li><a href="http://tieba.baidu.com/" data-path="f?kw=">贴吧</a></li>

<li><a href="https://zhidao.baidu.com/" data-path="search?ct=17&pn=0&tn=ikaslist&rn=10&lm=0&word=">知道</a></li>

<li><a href="http://music.baidu.com/" data-path="search?fr=news&ie=utf-8&key=">音乐</a></li>

<li><a href="http://image.baidu.com/" data-path="search/index?ct=201326592&cl=2&lm=-1&tn=baiduimage&istype=2&fm=&pv=&z=0&word=">图片</a></li>

<li><a href="http://v.baidu.com/" data-path="v?ct=3019898888&ie=utf-8&s=2&word=">视频</a></li>

<li><a href="http://map.baidu.com/" data-path="?newmap=1&ie=utf-8&s=s%26wd%3D">地图</a></li>

<li><a href="http://wenku.baidu.com/" data-path="search?ie=utf-8&word=">文库</a></li>

<div class="header-divider"></div>

</ul>

</div>

"""

#利用etree.HTML,将字符串解析为HTML文档

html_text = etree.HTML(text)        #html_text为Element对象    （可以执行xpath语法）

#将字符串序列化HTML文档

result = etree.tostring(html_text,encoding='utf-8').decode('utf-8')

print(result)

2.从文件中读取html代码：

from lxml import etree

#读取外部文件 hello.html

html = etree.parse('hello.html')

result = etree.tostring(html,pretty_print=True,encoding='utf-8').decode('utf-8')

print(result)

在上代码中，如果html代码不是很规范的话会报错，这时就要更改解析器

from lxml import etree

#读取外部文件 hello.html

parser = etree.HTMLParser(encoding='utf-8')                    #用html解析器

html = etree.parse('hello.html',parser=parser)              #.parse默认解析是以xml    我们要解析html就要改为HTML解析器

result = etree.tostring(html,pretty_print=True,encoding='utf-8').decode('utf-8')

print(result)

lxml结合xpath：

from lxml import etree

parser = etree.HTMLParser(encoding='utf-8')

html = etree.parse("tencent.html",parser=parser)

#1.获取所有tr标签

#trs = html.xpath("//tr")

#for tr in trs:

#    print(etree.tostring(tr,encoding='utf-8').decode("utf-8"))

#2.获得第2个tr标签

#tr = html.xpath("//tr[2]")[0]

#print(etree.tostring(tr,encoding='utf-8').decode("utf-8"))

#3.获取所有class等于even的tr标签

#trs = html.xpath("//tr[@class='even']")

#for tr in trs:

#    print(etree.tostring(tr,encoding='utf-8').decode("utf-8"))

#4.获取所有a标签的href属性

# trs = html.xpath("//a/@href")           #只选择href的值 跟上面不同的是不是一个标签内容

# for tr in trs:

#     print(tr)

#5.获取所有的职位信息（纯文本）

trs = html.xpath("//tr[position()>1]")

positions = []

for tr in trs:

    href = tr.xpath(".//a/@href")[0]           # .意思是在当前标签（第一个tr）下查找

    fullurl = "http://hr.tencent.com/" + href

    title = tr.xpath("./td[1]//text()")

    category = tr.xpath("./td[2]/text()")

    nums = tr.xpath("./td[3]/text()")

    address = tr.xpath("./td[4]/text()")

    pubtime = tr.xpath("./td[5]/text()")

    position = {

        'url': fullurl,

        'title': title,

        'category': category,

        'nums': nums,

        'address':address,

        'pubtime': pubtime

    }

    positions.append(position)

print(positions)