Purpose
最近因为要买房子,扫过了各种信息,貌似lianjia上的数据还是靠点谱的(最起码房源图片没有太大的出入),心血来潮想着做几个图表来显示下房屋的数据信息,顺便练练手。
需求分析
1从lianjia的网站上获取关键的房屋信息数据,然后按照自己的需求通过图表显示出来。
2每天从lianjia的网站上获取一次数据
3以上海地区为主(本人在上海)
4最终生成图表有:房屋交易总量,二手房均价,在售房源,近90天成交量,昨日带看次数
分析获取网站数据
1 数据源
数据的获取主要是从两个地方:
http://sh.lianjia.com/chengjiao/ //成交量数据统计获取
页面上的数据(下面显示的是没登录前的量,貌似登录之后会比这个量要多一点):
http://sh.lianjia.com/ershoufang/ //二手房相关数据获取
页面上数据:
2 获取方法获取网页数据的话,首先想到的是scrapy,不过考虑到获取的数据不是很多很复杂,这里只用urllib.request来获取就可以了。后面因为使用到tornado的异步,所以会替换成httpclient.AsyncHTTPClient().fetch()。
3 使用urllib.request来获取相关数据。
首先,从网页上爬数据,使用obtain_page_data基础的函数:
def obtain_page_data(target_url):
with urllib.request.urlopen(target_url) as f:
data = f.read().decode('utf8')
return data
obtain_page_data()函数的话,主要是访问给定页面,然后返回页面的数据
然后,获取了数据之后,要按照需求来获取网页上的数据,主要是两大块:
1)房屋总成交量(http://sh.lianjia.com/chengjiao/)
定义函数get_total_dealed_house(),函数最终是返回页面上的总成交量,那么在调用obtain_page_data()获取页面的data后,分析下这个数据是在哪个位置。
那么看到数据一个div下,那么使用BeautifulSoup解析一下获取的html数据后,通过下面的命令来获取text数据:
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
找到了text内容之后通过正则表达式过滤掉非数字的字符,然后就获取到了这个数据,具体如下:
def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] return int(dealed_house_num)
2)获取其他在线数据(http://sh.lianjia.com/ershoufang/)
类似的,要先分析自己要的数据在网页中的哪个位置,然后去获取,过滤,具体如下:
def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) return {'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num}
3)数据整合/细分各区
使用shanghai_data_process()函数来整合一下1,2中获取的数据,另外lianjia网页上上海区域的数据其实是可以按照各个区来查询的,那么这里也做一下处理,如下:
def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
sh_area_dict = {
"all":"",
"pudongxinqu": "pudongxinqu/",
"minhang": "minhang/",
"baoshan": "baoshan/",
"xuhui": "xuhui/",
"putuo": "putuo/",
"yangpu": "yangpu/",
"changning": "changning/",
"songjiang": "songjiang/",
"jiading": "jiading/",
"huangpu": "huangpu/",
"jingan": "jingan/",
"zhabei": "zhabei/",
"hongkou": "hongkou/",
"qingpu": "qingpu/",
"fengxian": "fengxian/",
"jinshan": "jinshan/",
"chongming": "chongming/",
"shanghaizhoubian": "shanghaizhoubian/",
}
dealed_house_num = get_total_dealed_house(chenjiao_page)
sh_online_data = {}
for key,value in sh_area_dict.items():
sh_online_data[key] = get_online_data(ershoufang_page+sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value)
4)整体代码以及输出效果
import urllib.request
import re
from bs4 import BeautifulSoup
import time def obtain_page_data(target_url):
with urllib.request.urlopen(target_url) as f:
data = f.read().decode('utf8')
return data def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] return int(dealed_house_num) def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) return {'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num} def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
sh_area_dict = {
"all":"",
"pudongxinqu": "pudongxinqu/",
"minhang": "minhang/",
"baoshan": "baoshan/",
"xuhui": "xuhui/",
"putuo": "putuo/",
"yangpu": "yangpu/",
"changning": "changning/",
"songjiang": "songjiang/",
"jiading": "jiading/",
"huangpu": "huangpu/",
"jingan": "jingan/",
"zhabei": "zhabei/",
"hongkou": "hongkou/",
"qingpu": "qingpu/",
"fengxian": "fengxian/",
"jinshan": "jinshan/",
"chongming": "chongming/",
"shanghaizhoubian": "shanghaizhoubian/",
}
dealed_house_num = get_total_dealed_house(chenjiao_page)
sh_online_data = {}
for key,value in sh_area_dict.items():
sh_online_data[key] = get_online_data(ershoufang_page+sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value) def main():
start_time = time.time()
shanghai_data_process()
print("time cost: %s" % (time.time() - start_time)) if __name__=='__main__':
main()
初版源码collect_data.py
Result:
dealed_house_num 51691
zhabei {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
changning {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
baoshan {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
putuo {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
qingpu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
jinshan {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
chongming {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
all {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
jingan {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
xuhui {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
songjiang {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
yangpu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
pudongxinqu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
shanghaizhoubian {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
minhang {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
hongkou {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
fengxian {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
jiading {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
huangpu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
time cost: 12.94211196899414
Result
移植到tornado上
1 为什么要使用tornado
tornado是一个小巧的异步的python框架,这里使用到它是因为在发送request获取网页数据(IO密集)其实可以使用异步来提高效率,特别是在后期访问量大的时候,使用tornado会提高效率。
2 移植上面初步获取数据功能到tornado上
这里的关键点有这么几个:
1)异步获取网页数据
使用httpclient.AsyncHTTPClient().fetch()来获取页面数据,配合使用gen.coroutine+yield来实现异步。
2)返回数据的时候要使用raise gen.Return(data)
3)初步改造后的版本以及运行结果如下:
import re
from bs4 import BeautifulSoup
import time
from tornado import httpclient,gen,ioloop @gen.coroutine
def obtain_page_data(target_url):
response = yield httpclient.AsyncHTTPClient().fetch(target_url)
data = response.body.decode('utf8')
print("start %s %s" %(target_url,time.time())) raise gen.Return(data) @gen.coroutine
def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] raise gen.Return(int(dealed_house_num)) @gen.coroutine
def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) raise gen.Return({'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num}) @gen.coroutine
def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
start_time = time.time()
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
dealed_house_num = yield get_total_dealed_house(chenjiao_page)
sh_area_dict = {
"all": "",
"pudongxinqu": "pudongxinqu/",
"minhang": "minhang/",
"baoshan": "baoshan/",
"xuhui": "xuhui/",
"putuo": "putuo/",
"yangpu": "yangpu/",
"changning": "changning/",
"songjiang": "songjiang/",
"jiading": "jiading/",
"huangpu": "huangpu/",
"jingan": "jingan/",
"zhabei": "zhabei/",
"hongkou": "hongkou/",
"qingpu": "qingpu/",
"fengxian": "fengxian/",
"jinshan": "jinshan/",
"chongming": "chongming/",
"shanghaizhoubian": "shanghaizhoubian/",
}
sh_online_data = {}
for key,value in sh_area_dict.items():
sh_online_data[key] = yield get_online_data(ershoufang_page+sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value) print("tornado time cost: %s" %(time.time()-start_time) ) if __name__=='__main__':
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(shanghai_data_process)
tornado初版
start http://sh.lianjia.com/chengjiao/ 1480320585.879013
start http://sh.lianjia.com/ershoufang/jinshan/ 1480320586.575354
start http://sh.lianjia.com/ershoufang/chongming/ 1480320587.017322
start http://sh.lianjia.com/ershoufang/yangpu/ 1480320587.515317
start http://sh.lianjia.com/ershoufang/hongkou/ 1480320588.051793
start http://sh.lianjia.com/ershoufang/fengxian/ 1480320588.593865
start http://sh.lianjia.com/ershoufang/jiading/ 1480320589.134367
start http://sh.lianjia.com/ershoufang/qingpu/ 1480320589.6134
start http://sh.lianjia.com/ershoufang/pudongxinqu/ 1480320590.215136
start http://sh.lianjia.com/ershoufang/putuo/ 1480320590.696576
start http://sh.lianjia.com/ershoufang/zhabei/ 1480320591.34218
start http://sh.lianjia.com/ershoufang/changning/ 1480320591.935762
start http://sh.lianjia.com/ershoufang/xuhui/ 1480320592.5159
start http://sh.lianjia.com/ershoufang/minhang/ 1480320593.096085
start http://sh.lianjia.com/ershoufang/songjiang/ 1480320593.749226
start http://sh.lianjia.com/ershoufang/ 1480320594.306287
start http://sh.lianjia.com/ershoufang/shanghaizhoubian/ 1480320594.807418
start http://sh.lianjia.com/ershoufang/huangpu/ 1480320595.2744
start http://sh.lianjia.com/ershoufang/jingan/ 1480320595.850909
start http://sh.lianjia.com/ershoufang/baoshan/ 1480320596.368479
dealed_house_num 51691
jinshan {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
yangpu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
hongkou {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
fengxian {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
chongming {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
pudongxinqu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
putuo {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
zhabei {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
changning {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
baoshan {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
xuhui {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
minhang {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
songjiang {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
all {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
shanghaizhoubian {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
jingan {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
jiading {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
qingpu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
huangpu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
tornado time cost: 10.953541040420532
初版运行结果
存储数据到数据库中
这里我使用的是mysql数据库,那么在tornado中可以使用pymysql来连接数据库,并且我这里使用了sqlalchemy来完成程序中的DML。
sqlalchemy部分的内容详见这里。
1)表结构
这里需要的表不是很多,如下:
sh_area //上海区域表,存放上海各个区域
aaarticlea/png;base64," alt="" />
sh_total_city_dealed //上海地区二手房总成交量
online_data //上海各区二手房数据
2) 使用sqlalchemy来初始化表
settings中设置的是数据库连接相关内容。
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
DB={
'connector':'mysql+pymysql://root:[email protected]:3306/devdb1',
'max_session':5
} engine = create_engine(DB['connector'], max_overflow= DB['max_session'], echo= False)
SessionCls = sessionmaker(bind=engine)
session = SessionCls()
settings.py
初始化脚本
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column,Integer,String,ForeignKey,DateTime import os,sys
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(BASE_DIR) from conf import settings Base = declarative_base() class SH_Area(Base):
__tablename__ = 'sh_area' # 表名
id = Column(Integer, primary_key=True)
name = Column(String(64)) class Online_Data(Base):
__tablename__ = 'online_data' # 表名
id = Column(Integer, primary_key=True)
sold_in_90 = Column(Integer)
avg_price = Column(Integer)
yesterday_check_num = Column(Integer)
on_sale = Column(Integer)
date = Column(DateTime)
belong_area = Column(Integer,ForeignKey('sh_area.id')) class SH_Total_city_dealed(Base):
__tablename__ = 'sh_total_city_dealed' # 表名
id = Column(Integer, primary_key=True)
dealed_house_num = Column(Integer)
date = Column(DateTime)
memo = Column(String(64),nullable=True) def db_init():
Base.metadata.create_all(settings.engine) # 创建表结构
for district in settings.sh_area_dict.keys():
item_obj = SH_Area(name = district)
settings.session.add(item_obj)
settings.session.commit() if __name__ == '__main__':
db_init()
database_init
图表绘制
1前端绘制
图表绘制的话,这里我使用的是Highcharts。图形比较美观,使用的时候只需要提供需要的数据即可。
我使用的是基础折线图,需要在前端引入几个js文件,如下:jquery.min.js,highcharts.js,exporting.js。然后添加一个div,使用id来标示这个div,样例中使用的是id="container"
官方js部分的代码如下:
$(function () {
$('#container').highcharts({
title: {
text: 'Monthly Average Temperature',
x: -20 //center
},
subtitle: {
text: 'Source: WorldClimate.com',
x: -20
},
xAxis: {
categories: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
},
yAxis: {
title: {
text: 'Temperature (°C)'
},
plotLines: [{
value: 0,
width: 1,
color: '#808080'
}]
},
tooltip: {
valueSuffix: '°C'
},
legend: {
layout: 'vertical',
align: 'right',
verticalAlign: 'middle',
borderWidth: 0
},
series: [{
name: 'Tokyo',
data: [7.0, 6.9, 9.5, 14.5, 18.2, 21.5, 25.2, 26.5, 23.3, 18.3, 13.9, 9.6]
}, {
name: 'New York',
data: [-0.2, 0.8, 5.7, 11.3, 17.0, 22.0, 24.8, 24.1, 20.1, 14.1, 8.6, 2.5]
}, {
name: 'Berlin',
data: [-0.9, 0.6, 3.5, 8.4, 13.5, 17.0, 18.6, 17.9, 14.3, 9.0, 3.9, 1.0]
}, {
name: 'London',
data: [3.9, 4.2, 5.7, 8.5, 11.9, 15.2, 17.0, 16.6, 14.2, 10.3, 6.6, 4.8]
}]
});
});
官方js
我的工作是在这个基础上,修改js内容来画出符合自己的图。
具体的参考github上代码中的修改,最后画出来的图是这样的。
2 后端获取数据并传输给前端
基本上前端表哥需要的数据是一维或者二维数组,比如横坐标时间数组[time1,time2,time3],纵坐标数据数组[data1,data2,data3]这样子。
这里需要注意几点:
1)tornado后端返回数据,使用render()函数渲染到指定的页面即可。
2) js中使用{{ data_rendered }}来获取数据
3)后端传入前端的时间数据为timestamp时间戳,这里需要format一下显示,如下:
function formatDate(timestamp_v) {
var now = new Date(parseFloat(timestamp_v)*1000);
var year=now.getFullYear();
var month=now.getMonth()+1;
var date=now.getDate();
var hour=now.getHours();
var minute=now.getMinutes();
var second=now.getSeconds();
return year+"-"+month+"-"+date+" "+hour+":"+minute+":"+second; };
formatDate
4)注意js部分二维数组的定义处理
3 前端请求传给后端参数
因为需求中可以查询上海各个区的图表,那么可以设计访问地址为r'/view/(\w+)/(\w+)',这样前面是city(比如sh,bj等)后面是具体的哪个区area。后端接收到这两个参数后去数据库中查找数据并返回。
最终成型
在数据库中有了数据之后,后面的内容就是前端后端数据的交互,在前端哪些地方绘制图表,需要什么数据,后端返回即可,最终主要的代码是这样的:
import re
from bs4 import BeautifulSoup
import datetime
import time
from tornado import httpclient,gen,ioloop,httpserver
from tornado import web
import tornado.options
import json import os,sys
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(BASE_DIR) from conf import settings
from database_init import Online_Data,SH_Total_city_dealed,SH_Area
from tornado.options import define,options define("port",default=8888,type=int) @gen.coroutine
def obtain_page_data(target_url):
response = yield httpclient.AsyncHTTPClient().fetch(target_url)
data = response.body.decode('utf8')
print("start %s %s" %(target_url,time.time())) raise gen.Return(data) @gen.coroutine
def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] raise gen.Return(int(dealed_house_num)) @gen.coroutine
def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) raise gen.Return({'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num}) @gen.coroutine
def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
start_time = time.time()
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
dealed_house_num = yield get_total_dealed_house(chenjiao_page)
sh_online_data = {}
for key,value in settings.sh_area_dict.items():
sh_online_data[key] = yield get_online_data(ershoufang_page+settings.sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value) print("tornado time cost: %s" %(time.time()-start_time) ) #settings.session
update_date = datetime.datetime.now()
dealed_house_num_obj = SH_Total_city_dealed(dealed_house_num=dealed_house_num,
date = update_date)
settings.session.add(dealed_house_num_obj) for key,value in sh_online_data.items():
area_obj = settings.session.query(SH_Area).filter_by(name=key).first()
online_data_obj = Online_Data(sold_in_90 = value['sold_in_90'],
avg_price = value['avg_price'],
yesterday_check_num = value['yesterday_check_num'],
on_sale = value['on_sale'],
date = update_date,
belong_area = area_obj.id)
settings.session.add(online_data_obj)
settings.session.commit() class IndexHandler(web.RequestHandler):
def get(self, *args, **kwargs):
total_dealed_house_num = settings.session.query(SH_Total_city_dealed).all()
cata_list = []
data_list = []
for item in total_dealed_house_num:
cata_list.append(time.mktime(item.date.timetuple()))
data_list.append(item.dealed_house_num) area_id = settings.session.query(SH_Area).filter_by(name='all').first()
area_avg_price = settings.session.query(Online_Data).filter_by(belong_area = area_id.id).all()
area_date_list = []
area_data_list = []
area_on_sale_list = []
area_sold_in_90_list = []
area_yesterday_check_num = []
for item in area_avg_price:
area_date_list.append(time.mktime(item.date.timetuple()))
area_data_list.append(item.avg_price)
area_on_sale_list.append([time.mktime(item.date.timetuple()),item.on_sale])
area_sold_in_90_list.append(item.sold_in_90)
area_yesterday_check_num.append(item.yesterday_check_num)
self.render("index.html",cata_list=cata_list,
data_list=data_list,area_date_list = area_date_list,area_data_list = area_data_list,
area_on_sale_list = area_on_sale_list,area_sold_in_90_list=area_sold_in_90_list,
area_yesterday_check_num = area_yesterday_check_num,city="sh",area="all") class QueryHandler(web.RequestHandler):
def get(self,city,area): if city == "sh":
total_dealed_house_num = settings.session.query(SH_Total_city_dealed).all() cata_list = []
data_list = []
for item in total_dealed_house_num:
cata_list.append(time.mktime(item.date.timetuple()))
data_list.append(item.dealed_house_num) area_id = settings.session.query(SH_Area).filter_by(name=area).first()
area_avg_price = settings.session.query(Online_Data).filter_by(belong_area=area_id.id).all()
area_date_list = []
area_data_list = []
area_on_sale_list = []
area_sold_in_90_list = []
area_yesterday_check_num = []
for item in area_avg_price:
area_date_list.append(time.mktime(item.date.timetuple()))
area_data_list.append(item.avg_price)
area_on_sale_list.append([time.mktime(item.date.timetuple()), item.on_sale])
area_sold_in_90_list.append(item.sold_in_90)
area_yesterday_check_num.append(item.yesterday_check_num) self.render("index.html", cata_list=cata_list,
data_list=data_list, area_date_list=area_date_list, area_data_list=area_data_list,
area_on_sale_list=area_on_sale_list, area_sold_in_90_list=area_sold_in_90_list,
area_yesterday_check_num=area_yesterday_check_num,city=city,area=area)
else:
self.redirect("/") class MyApplication(web.Application):
def __init__(self):
handlers = [
(r'/',IndexHandler),
(r'/view/(\w+)/(\w+)',QueryHandler), ] settings = {
'static_path': os.path.join(os.path.dirname(os.path.dirname(__file__)), "static"),
'template_path': os.path.join(os.path.dirname(os.path.dirname(__file__)), "templates"),
} super(MyApplication,self).__init__(handlers,**settings) # ioloop.PeriodicCallback(f2s, 2000).start() if __name__=='__main__':
http_server = httpserver.HTTPServer(MyApplication())
http_server.listen(options.port)
ioloop.PeriodicCallback(shanghai_data_process,86400000).start() #毫秒 86400000
ioloop.IOLoop.instance().start()
data_collect
几点说明:
1 因为要定期去网页上获取数据,这里使用了ioloop.PeriodicCallback()函数来定时处理。
结合nginx部署
自己有一台AWS 的EC2虚机,操作系统是centos7,最后是要把程序放到上面去跑。
1 安装部署nginx
因为时间关系没有做过深入的研究,只是从网上翻了下几本的东西,如下:
1 使用wget下载nginx包(nginx-1.11.6.tar.gz),并解压
2 进入nginx-1.11.6
3 ./configure
4 make
5 make install
配置文件修改/usr/local/nginx/conf/nginx.conf
reload nginx 使用 /usr/local/nginx/sbin/nginx -s reload
2 调整虚机的inbound 防火墙规则,我添加的是80端口(nginx配置文件中同样监听80端口)
1、登录到AWS console主界面
2、左侧INSTANCES-Instances
3、右侧group security
4、下面inbounds
5、edit
6、edit inbounds rules页面中自己添加规则
3 测试访问nginx
如果正常,会显示Welcome nginx的页面
4 运行tornadao代码后reload nginx
效果图以及代码
1 几个效果图如下:
2 代码放在github上
解决sqlalchemy session问题
在代码运行之后的几天发现,每隔大约半天的时间,程序虽然不会挂掉,但是在浏览器访问的时候会出现500 error。后台日志中也会报访问的错误。
仔细研究了下后台日志的报错,发现应该是浏览器使用旧的session信息来访问,但是session信息在程序中已经过期,所以导致错误。仔细审查了下代码,确实是在settings文件中初始化了一个session,然后后面所有的DB相关操作都用了这个session。显然是有问题的。
解决办法其实很简单,只要把数据库session的生命周期与http 每次request的生命周期放在一起即可。也就说在每次http request开始的时候初始化一个db session,然后在每次reqeust结束的时候close掉这个db session即可。可以参考下flask框架中这部分内容的介绍。
1 sqlalchemy部分
为了实现上述的说明,sqlalchemy 这边需要使用一个新的对象scoped_session,官方示例如下:
>>> from sqlalchemy.orm import scoped_session
>>> from sqlalchemy.orm import sessionmaker #创建session
>>> session_factory = sessionmaker(bind=some_engine)
>>> Session = scoped_session(session_factory) #关闭session
>>> Session.remove()
更多的说明参考这里。
2 tornado 部分
在RequestHandler中重写initialize()和on_finish()两个函数。initialize()函数中初始化db session,而在on_finish()的时候结束这个db session。BaseHandler是一个基础的handler,其他request handler 只需要继承 BaseHandler即可。
class BaseHandler(web.RequestHandler):
def initialize(self):
self.db_session = scoped_session(sessionmaker(bind=settings.engine))
self.db_query = self.db_session().query def on_finish(self):
self.db_session.remove()