自定义worker的方法，及一例

自定义的worker用于处理各种特殊需求。

写一个worker，只要准备两个函数就可以了：

1、用@worker('xueqiu')装饰的函数是xueqiu的worker，它有两个参数：

参数data_dict是保存信息源data内容的字典，也就是驱动worker工作的数据，是由下面那个函数生成的。
参数worker_dict也是字典，可以存放一些动态数据，供下次运行此worker时使用。这里没有用到它。

如果在运行worker时出现异常，可以用c_worker_exception(title, url='', summary='')生成一条异常信息，以便更清晰地向用户描述出现的问题，具体方法可参考html_re、html_json。

2、用@dataparser('xueqiu')装饰的函数是xueqiu的xml解析器，它的作用是把信息源xml里的data翻译成一个字典，也就是worker的参数data_dict。有一个参数：

参数xml_string是信息源xml文件的完整内容，一个字符串。

把程序以utf-8编码保存为xueqiu.py，放到src/workers目录下，重启程序就可以用了。

实际用法和html_json完全一样，只不过xml里的worker要换成xueqiu。

# coding=utf-8
import urllib.request

from http.cookiejar import CookieJar

from worker_manage import worker, dataparser

from . import html_json

ua = ('Mozilla/5.0 (Windows NT 6.1; rv:38.0)'

      ' Gecko/20100101 Firefox/38.0')

# 从首页得到cookies

def get_cookies():

    # build opener

    proxy = urllib.request.ProxyHandler({})

    cj = urllib.request.HTTPCookieProcessor(CookieJar())

    opener = urllib.request.build_opener(proxy, cj)

    # request

    req = urllib.request.Request('https://xueqiu.com/')

    req.add_header('User-Agent', ua)

    # open

    r = opener.open(req)

    return cj

# 下载指定网址

def get_url(cj, url):

    # build opener

    proxy = urllib.request.ProxyHandler({})

    opener = urllib.request.build_opener(proxy, cj)

    # request

    req = urllib.request.Request(url)

    req.add_header('User-Agent', ua)

    # open

    r = opener.open(req)

    ret_data = r.read().decode('utf-8')

    return ret_data

@worker('xueqiu')

def xueqiu_worker(data_dict, worker_dict):

    # 得到cookies

    cj = get_cookies()

    # 用cookies下载指定网址

    url = data_dict['url']

    string = get_url(cj, url)

    # 用html_json解析数据

    return html_json.parse_html(data_dict, url, string)

@dataparser('xueqiu')

def xueqiu_parser(xml_string):

    return html_json.html_json_parser(xml_string)

这个worker很简陋，没有考虑网络超时、自动重试，也没有用c_worker_exception生成更清晰的异常信息。

如果用它抓取的是国内金融交易数据，还可以精益求精，加上时间判断，在非交易时段直接返回一个空列表。