Purpose

最近因为要买房子,扫过了各种信息,貌似lianjia上的数据还是靠点谱的(最起码房源图片没有太大的出入),心血来潮想着做几个图表来显示下房屋的数据信息,顺便练练手。

需求分析

1从lianjia的网站上获取关键的房屋信息数据,然后按照自己的需求通过图表显示出来。

2每天从lianjia的网站上获取一次数据

3以上海地区为主(本人在上海)

4最终生成图表有:房屋交易总量,二手房均价,在售房源,近90天成交量,昨日带看次数

分析获取网站数据

1 数据源

数据的获取主要是从两个地方:

http://sh.lianjia.com/chengjiao/   //成交量数据统计获取

页面上的数据(下面显示的是没登录前的量,貌似登录之后会比这个量要多一点):

http://sh.lianjia.com/ershoufang/  //二手房相关数据获取

页面上数据:

2 获取方法获取网页数据的话,首先想到的是scrapy,不过考虑到获取的数据不是很多很复杂,这里只用urllib.request来获取就可以了。后面因为使用到tornado的异步,所以会替换成httpclient.AsyncHTTPClient().fetch()。

3 使用urllib.request来获取相关数据。

首先,从网页上爬数据,使用obtain_page_data基础的函数:

 def obtain_page_data(target_url):
with urllib.request.urlopen(target_url) as f:
data = f.read().decode('utf8')
return data
obtain_page_data()函数的话,主要是访问给定页面,然后返回页面的数据

然后,获取了数据之后,要按照需求来获取网页上的数据,主要是两大块:

1)房屋总成交量(http://sh.lianjia.com/chengjiao/)

定义函数get_total_dealed_house(),函数最终是返回页面上的总成交量,那么在调用obtain_page_data()获取页面的data后,分析下这个数据是在哪个位置。

那么看到数据一个div下,那么使用BeautifulSoup解析一下获取的html数据后,通过下面的命令来获取text数据:

dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text

找到了text内容之后通过正则表达式过滤掉非数字的字符,然后就获取到了这个数据,具体如下:

 def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] return int(dealed_house_num)

2)获取其他在线数据(http://sh.lianjia.com/ershoufang/)

类似的,要先分析自己要的数据在网页中的哪个位置,然后去获取,过滤,具体如下:

 def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) return {'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num}

3)数据整合/细分各区

使用shanghai_data_process()函数来整合一下1,2中获取的数据,另外lianjia网页上上海区域的数据其实是可以按照各个区来查询的,那么这里也做一下处理,如下:

 def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
sh_area_dict = {
"all":"",
"pudongxinqu": "pudongxinqu/",
"minhang": "minhang/",
"baoshan": "baoshan/",
"xuhui": "xuhui/",
"putuo": "putuo/",
"yangpu": "yangpu/",
"changning": "changning/",
"songjiang": "songjiang/",
"jiading": "jiading/",
"huangpu": "huangpu/",
"jingan": "jingan/",
"zhabei": "zhabei/",
"hongkou": "hongkou/",
"qingpu": "qingpu/",
"fengxian": "fengxian/",
"jinshan": "jinshan/",
"chongming": "chongming/",
"shanghaizhoubian": "shanghaizhoubian/",
}
dealed_house_num = get_total_dealed_house(chenjiao_page)
sh_online_data = {}
for key,value in sh_area_dict.items():
sh_online_data[key] = get_online_data(ershoufang_page+sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value)

4)整体代码以及输出效果

 import urllib.request
import re
from bs4 import BeautifulSoup
import time def obtain_page_data(target_url):
with urllib.request.urlopen(target_url) as f:
data = f.read().decode('utf8')
return data def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] return int(dealed_house_num) def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) return {'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num} def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
sh_area_dict = {
"all":"",
"pudongxinqu": "pudongxinqu/",
"minhang": "minhang/",
"baoshan": "baoshan/",
"xuhui": "xuhui/",
"putuo": "putuo/",
"yangpu": "yangpu/",
"changning": "changning/",
"songjiang": "songjiang/",
"jiading": "jiading/",
"huangpu": "huangpu/",
"jingan": "jingan/",
"zhabei": "zhabei/",
"hongkou": "hongkou/",
"qingpu": "qingpu/",
"fengxian": "fengxian/",
"jinshan": "jinshan/",
"chongming": "chongming/",
"shanghaizhoubian": "shanghaizhoubian/",
}
dealed_house_num = get_total_dealed_house(chenjiao_page)
sh_online_data = {}
for key,value in sh_area_dict.items():
sh_online_data[key] = get_online_data(ershoufang_page+sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value) def main():
start_time = time.time()
shanghai_data_process()
print("time cost: %s" % (time.time() - start_time)) if __name__=='__main__':
main()

初版源码collect_data.py

Result:

 dealed_house_num 51691
zhabei {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
changning {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
baoshan {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
putuo {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
qingpu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
jinshan {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
chongming {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
all {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
jingan {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
xuhui {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
songjiang {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
yangpu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
pudongxinqu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
shanghaizhoubian {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
minhang {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
hongkou {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
fengxian {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
jiading {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
huangpu {'yesterday_check_num': '', 'sold_in_90': '', 'avg_price': '', 'on_sale': ''}
time cost: 12.94211196899414

Result

移植到tornado上

1 为什么要使用tornado

tornado是一个小巧的异步的python框架,这里使用到它是因为在发送request获取网页数据(IO密集)其实可以使用异步来提高效率,特别是在后期访问量大的时候,使用tornado会提高效率。

2 移植上面初步获取数据功能到tornado上

这里的关键点有这么几个:

1)异步获取网页数据

使用httpclient.AsyncHTTPClient().fetch()来获取页面数据,配合使用gen.coroutine+yield来实现异步。

2)返回数据的时候要使用raise gen.Return(data)

3)初步改造后的版本以及运行结果如下:

 import re
from bs4 import BeautifulSoup
import time
from tornado import httpclient,gen,ioloop @gen.coroutine
def obtain_page_data(target_url):
response = yield httpclient.AsyncHTTPClient().fetch(target_url)
data = response.body.decode('utf8')
print("start %s %s" %(target_url,time.time())) raise gen.Return(data) @gen.coroutine
def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] raise gen.Return(int(dealed_house_num)) @gen.coroutine
def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) raise gen.Return({'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num}) @gen.coroutine
def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
start_time = time.time()
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
dealed_house_num = yield get_total_dealed_house(chenjiao_page)
sh_area_dict = {
"all": "",
"pudongxinqu": "pudongxinqu/",
"minhang": "minhang/",
"baoshan": "baoshan/",
"xuhui": "xuhui/",
"putuo": "putuo/",
"yangpu": "yangpu/",
"changning": "changning/",
"songjiang": "songjiang/",
"jiading": "jiading/",
"huangpu": "huangpu/",
"jingan": "jingan/",
"zhabei": "zhabei/",
"hongkou": "hongkou/",
"qingpu": "qingpu/",
"fengxian": "fengxian/",
"jinshan": "jinshan/",
"chongming": "chongming/",
"shanghaizhoubian": "shanghaizhoubian/",
}
sh_online_data = {}
for key,value in sh_area_dict.items():
sh_online_data[key] = yield get_online_data(ershoufang_page+sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value) print("tornado time cost: %s" %(time.time()-start_time) ) if __name__=='__main__':
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(shanghai_data_process)

tornado初版

 start http://sh.lianjia.com/chengjiao/ 1480320585.879013
start http://sh.lianjia.com/ershoufang/jinshan/ 1480320586.575354
start http://sh.lianjia.com/ershoufang/chongming/ 1480320587.017322
start http://sh.lianjia.com/ershoufang/yangpu/ 1480320587.515317
start http://sh.lianjia.com/ershoufang/hongkou/ 1480320588.051793
start http://sh.lianjia.com/ershoufang/fengxian/ 1480320588.593865
start http://sh.lianjia.com/ershoufang/jiading/ 1480320589.134367
start http://sh.lianjia.com/ershoufang/qingpu/ 1480320589.6134
start http://sh.lianjia.com/ershoufang/pudongxinqu/ 1480320590.215136
start http://sh.lianjia.com/ershoufang/putuo/ 1480320590.696576
start http://sh.lianjia.com/ershoufang/zhabei/ 1480320591.34218
start http://sh.lianjia.com/ershoufang/changning/ 1480320591.935762
start http://sh.lianjia.com/ershoufang/xuhui/ 1480320592.5159
start http://sh.lianjia.com/ershoufang/minhang/ 1480320593.096085
start http://sh.lianjia.com/ershoufang/songjiang/ 1480320593.749226
start http://sh.lianjia.com/ershoufang/ 1480320594.306287
start http://sh.lianjia.com/ershoufang/shanghaizhoubian/ 1480320594.807418
start http://sh.lianjia.com/ershoufang/huangpu/ 1480320595.2744
start http://sh.lianjia.com/ershoufang/jingan/ 1480320595.850909
start http://sh.lianjia.com/ershoufang/baoshan/ 1480320596.368479
dealed_house_num 51691
jinshan {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
yangpu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
hongkou {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
fengxian {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
chongming {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
pudongxinqu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
putuo {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
zhabei {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
changning {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
baoshan {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
xuhui {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
minhang {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
songjiang {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
all {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
shanghaizhoubian {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
jingan {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
jiading {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
qingpu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
huangpu {'yesterday_check_num': '', 'on_sale': '', 'avg_price': '', 'sold_in_90': ''}
tornado time cost: 10.953541040420532

初版运行结果

存储数据到数据库中

这里我使用的是mysql数据库,那么在tornado中可以使用pymysql来连接数据库,并且我这里使用了sqlalchemy来完成程序中的DML。

sqlalchemy部分的内容详见这里。

1)表结构

这里需要的表不是很多,如下:

sh_area   //上海区域表,存放上海各个区域
aaarticlea/png;base64," alt="" />

sh_total_city_dealed  //上海地区二手房总成交量

online_data  //上海各区二手房数据

2) 使用sqlalchemy来初始化表

settings中设置的是数据库连接相关内容。

 from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
DB={
'connector':'mysql+pymysql://root:[email protected]:3306/devdb1',
'max_session':5
} engine = create_engine(DB['connector'], max_overflow= DB['max_session'], echo= False)
SessionCls = sessionmaker(bind=engine)
session = SessionCls()

settings.py

初始化脚本

 from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column,Integer,String,ForeignKey,DateTime import os,sys
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(BASE_DIR) from conf import settings Base = declarative_base() class SH_Area(Base):
__tablename__ = 'sh_area' # 表名
id = Column(Integer, primary_key=True)
name = Column(String(64)) class Online_Data(Base):
__tablename__ = 'online_data' # 表名
id = Column(Integer, primary_key=True)
sold_in_90 = Column(Integer)
avg_price = Column(Integer)
yesterday_check_num = Column(Integer)
on_sale = Column(Integer)
date = Column(DateTime)
belong_area = Column(Integer,ForeignKey('sh_area.id')) class SH_Total_city_dealed(Base):
__tablename__ = 'sh_total_city_dealed' # 表名
id = Column(Integer, primary_key=True)
dealed_house_num = Column(Integer)
date = Column(DateTime)
memo = Column(String(64),nullable=True) def db_init():
Base.metadata.create_all(settings.engine) # 创建表结构
for district in settings.sh_area_dict.keys():
item_obj = SH_Area(name = district)
settings.session.add(item_obj)
settings.session.commit() if __name__ == '__main__':
db_init()

database_init

图表绘制

1前端绘制

图表绘制的话,这里我使用的是Highcharts。图形比较美观,使用的时候只需要提供需要的数据即可。

我使用的是基础折线图,需要在前端引入几个js文件,如下:jquery.min.js,highcharts.js,exporting.js。然后添加一个div,使用id来标示这个div,样例中使用的是id="container"

官方js部分的代码如下:

 $(function () {
$('#container').highcharts({
title: {
text: 'Monthly Average Temperature',
x: -20 //center
},
subtitle: {
text: 'Source: WorldClimate.com',
x: -20
},
xAxis: {
categories: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
},
yAxis: {
title: {
text: 'Temperature (°C)'
},
plotLines: [{
value: 0,
width: 1,
color: '#808080'
}]
},
tooltip: {
valueSuffix: '°C'
},
legend: {
layout: 'vertical',
align: 'right',
verticalAlign: 'middle',
borderWidth: 0
},
series: [{
name: 'Tokyo',
data: [7.0, 6.9, 9.5, 14.5, 18.2, 21.5, 25.2, 26.5, 23.3, 18.3, 13.9, 9.6]
}, {
name: 'New York',
data: [-0.2, 0.8, 5.7, 11.3, 17.0, 22.0, 24.8, 24.1, 20.1, 14.1, 8.6, 2.5]
}, {
name: 'Berlin',
data: [-0.9, 0.6, 3.5, 8.4, 13.5, 17.0, 18.6, 17.9, 14.3, 9.0, 3.9, 1.0]
}, {
name: 'London',
data: [3.9, 4.2, 5.7, 8.5, 11.9, 15.2, 17.0, 16.6, 14.2, 10.3, 6.6, 4.8]
}]
});
});

官方js

我的工作是在这个基础上,修改js内容来画出符合自己的图。

具体的参考github上代码中的修改,最后画出来的图是这样的。

2 后端获取数据并传输给前端

基本上前端表哥需要的数据是一维或者二维数组,比如横坐标时间数组[time1,time2,time3],纵坐标数据数组[data1,data2,data3]这样子。

这里需要注意几点:

1)tornado后端返回数据,使用render()函数渲染到指定的页面即可。

2) js中使用{{ data_rendered }}来获取数据

3)后端传入前端的时间数据为timestamp时间戳,这里需要format一下显示,如下:

 function formatDate(timestamp_v) {
var now = new Date(parseFloat(timestamp_v)*1000);
var year=now.getFullYear();
var month=now.getMonth()+1;
var date=now.getDate();
var hour=now.getHours();
var minute=now.getMinutes();
var second=now.getSeconds();
return year+"-"+month+"-"+date+" "+hour+":"+minute+":"+second; };

formatDate

4)注意js部分二维数组的定义处理

3 前端请求传给后端参数

因为需求中可以查询上海各个区的图表,那么可以设计访问地址为r'/view/(\w+)/(\w+)',这样前面是city(比如sh,bj等)后面是具体的哪个区area。后端接收到这两个参数后去数据库中查找数据并返回。

最终成型

在数据库中有了数据之后,后面的内容就是前端后端数据的交互,在前端哪些地方绘制图表,需要什么数据,后端返回即可,最终主要的代码是这样的:

 import re
from bs4 import BeautifulSoup
import datetime
import time
from tornado import httpclient,gen,ioloop,httpserver
from tornado import web
import tornado.options
import json import os,sys
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(BASE_DIR) from conf import settings
from database_init import Online_Data,SH_Total_city_dealed,SH_Area
from tornado.options import define,options define("port",default=8888,type=int) @gen.coroutine
def obtain_page_data(target_url):
response = yield httpclient.AsyncHTTPClient().fetch(target_url)
data = response.body.decode('utf8')
print("start %s %s" %(target_url,time.time())) raise gen.Return(data) @gen.coroutine
def get_total_dealed_house(target_url):
# 获取总的房屋成交量
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data,"html.parser")
dealed_house = soup_obj.html.body.find('div', {'class': 'list-head'}).text
dealed_house_num = re.findall(r'\d+', dealed_house)[0] raise gen.Return(int(dealed_house_num)) @gen.coroutine
def get_online_data(target_url):
# 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数
page_data = yield obtain_page_data(target_url)
soup_obj = BeautifulSoup(page_data, "html.parser")
online_data_str = soup_obj.html.body.find('div', {'class': 'secondcon'}).text
online_data = online_data_str.replace('\n', '')
avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r'\d+', online_data) raise gen.Return({'avg_price':avg_price,'on_sale':on_sale,'sold_in_90':sold_in_90,'yesterday_check_num':yesterday_check_num}) @gen.coroutine
def shanghai_data_process():
'''
获取上海各个区的数据
:return:
'''
start_time = time.time()
chenjiao_page = "http://sh.lianjia.com/chengjiao/"
ershoufang_page = "http://sh.lianjia.com/ershoufang/"
dealed_house_num = yield get_total_dealed_house(chenjiao_page)
sh_online_data = {}
for key,value in settings.sh_area_dict.items():
sh_online_data[key] = yield get_online_data(ershoufang_page+settings.sh_area_dict[key])
print("dealed_house_num %s" %dealed_house_num)
for key,value in sh_online_data.items():
print(key,value) print("tornado time cost: %s" %(time.time()-start_time) ) #settings.session
update_date = datetime.datetime.now()
dealed_house_num_obj = SH_Total_city_dealed(dealed_house_num=dealed_house_num,
date = update_date)
settings.session.add(dealed_house_num_obj) for key,value in sh_online_data.items():
area_obj = settings.session.query(SH_Area).filter_by(name=key).first()
online_data_obj = Online_Data(sold_in_90 = value['sold_in_90'],
avg_price = value['avg_price'],
yesterday_check_num = value['yesterday_check_num'],
on_sale = value['on_sale'],
date = update_date,
belong_area = area_obj.id)
settings.session.add(online_data_obj)
settings.session.commit() class IndexHandler(web.RequestHandler):
def get(self, *args, **kwargs):
total_dealed_house_num = settings.session.query(SH_Total_city_dealed).all()
cata_list = []
data_list = []
for item in total_dealed_house_num:
cata_list.append(time.mktime(item.date.timetuple()))
data_list.append(item.dealed_house_num) area_id = settings.session.query(SH_Area).filter_by(name='all').first()
area_avg_price = settings.session.query(Online_Data).filter_by(belong_area = area_id.id).all()
area_date_list = []
area_data_list = []
area_on_sale_list = []
area_sold_in_90_list = []
area_yesterday_check_num = []
for item in area_avg_price:
area_date_list.append(time.mktime(item.date.timetuple()))
area_data_list.append(item.avg_price)
area_on_sale_list.append([time.mktime(item.date.timetuple()),item.on_sale])
area_sold_in_90_list.append(item.sold_in_90)
area_yesterday_check_num.append(item.yesterday_check_num)
self.render("index.html",cata_list=cata_list,
data_list=data_list,area_date_list = area_date_list,area_data_list = area_data_list,
area_on_sale_list = area_on_sale_list,area_sold_in_90_list=area_sold_in_90_list,
area_yesterday_check_num = area_yesterday_check_num,city="sh",area="all") class QueryHandler(web.RequestHandler):
def get(self,city,area): if city == "sh":
total_dealed_house_num = settings.session.query(SH_Total_city_dealed).all() cata_list = []
data_list = []
for item in total_dealed_house_num:
cata_list.append(time.mktime(item.date.timetuple()))
data_list.append(item.dealed_house_num) area_id = settings.session.query(SH_Area).filter_by(name=area).first()
area_avg_price = settings.session.query(Online_Data).filter_by(belong_area=area_id.id).all()
area_date_list = []
area_data_list = []
area_on_sale_list = []
area_sold_in_90_list = []
area_yesterday_check_num = []
for item in area_avg_price:
area_date_list.append(time.mktime(item.date.timetuple()))
area_data_list.append(item.avg_price)
area_on_sale_list.append([time.mktime(item.date.timetuple()), item.on_sale])
area_sold_in_90_list.append(item.sold_in_90)
area_yesterday_check_num.append(item.yesterday_check_num) self.render("index.html", cata_list=cata_list,
data_list=data_list, area_date_list=area_date_list, area_data_list=area_data_list,
area_on_sale_list=area_on_sale_list, area_sold_in_90_list=area_sold_in_90_list,
area_yesterday_check_num=area_yesterday_check_num,city=city,area=area)
else:
self.redirect("/") class MyApplication(web.Application):
def __init__(self):
handlers = [
(r'/',IndexHandler),
(r'/view/(\w+)/(\w+)',QueryHandler), ] settings = {
'static_path': os.path.join(os.path.dirname(os.path.dirname(__file__)), "static"),
'template_path': os.path.join(os.path.dirname(os.path.dirname(__file__)), "templates"),
} super(MyApplication,self).__init__(handlers,**settings) # ioloop.PeriodicCallback(f2s, 2000).start() if __name__=='__main__':
http_server = httpserver.HTTPServer(MyApplication())
http_server.listen(options.port)
ioloop.PeriodicCallback(shanghai_data_process,86400000).start() #毫秒 86400000
ioloop.IOLoop.instance().start()

data_collect

几点说明:

1 因为要定期去网页上获取数据,这里使用了ioloop.PeriodicCallback()函数来定时处理。

结合nginx部署

自己有一台AWS 的EC2虚机,操作系统是centos7,最后是要把程序放到上面去跑。

1 安装部署nginx

因为时间关系没有做过深入的研究,只是从网上翻了下几本的东西,如下:

1 使用wget下载nginx包(nginx-1.11.6.tar.gz),并解压
2 进入nginx-1.11.6
3 ./configure
4 make
5 make install

配置文件修改/usr/local/nginx/conf/nginx.conf

reload nginx 使用 /usr/local/nginx/sbin/nginx -s reload

2 调整虚机的inbound 防火墙规则,我添加的是80端口(nginx配置文件中同样监听80端口)

1、登录到AWS console主界面
2、左侧INSTANCES-Instances
3、右侧group security
4、下面inbounds
5、edit
6、edit inbounds rules页面中自己添加规则

3 测试访问nginx

如果正常,会显示Welcome nginx的页面

4 运行tornadao代码后reload nginx

效果图以及代码

1 几个效果图如下:

2 代码放在github

解决sqlalchemy session问题

在代码运行之后的几天发现,每隔大约半天的时间,程序虽然不会挂掉,但是在浏览器访问的时候会出现500 error。后台日志中也会报访问的错误。

仔细研究了下后台日志的报错,发现应该是浏览器使用旧的session信息来访问,但是session信息在程序中已经过期,所以导致错误。仔细审查了下代码,确实是在settings文件中初始化了一个session,然后后面所有的DB相关操作都用了这个session。显然是有问题的。

解决办法其实很简单,只要把数据库session的生命周期与http 每次request的生命周期放在一起即可。也就说在每次http request开始的时候初始化一个db session,然后在每次reqeust结束的时候close掉这个db session即可。可以参考下flask框架中这部分内容的介绍

1 sqlalchemy部分

为了实现上述的说明,sqlalchemy 这边需要使用一个新的对象scoped_session,官方示例如下:

 >>> from sqlalchemy.orm import scoped_session
>>> from sqlalchemy.orm import sessionmaker #创建session
>>> session_factory = sessionmaker(bind=some_engine)
>>> Session = scoped_session(session_factory) #关闭session
>>> Session.remove()

更多的说明参考这里

2 tornado 部分

在RequestHandler中重写initialize()和on_finish()两个函数。initialize()函数中初始化db session,而在on_finish()的时候结束这个db session。BaseHandler是一个基础的handler,其他request handler 只需要继承 BaseHandler即可。

 class BaseHandler(web.RequestHandler):
def initialize(self):
self.db_session = scoped_session(sessionmaker(bind=settings.engine))
self.db_query = self.db_session().query def on_finish(self):
self.db_session.remove()
05-20 11:13