问题描述
我有一些数据.SQLite 数据库中的 224,000 行.我想从中提取时间序列信息以提供数据可视化工具.本质上,db 中的每一行都是一个事件,它具有(除其他外不严格相关的)自纪元以来以秒为单位的时间-日期组和负责它的名称.我想在数据库中提取每个名称每周有多少事件.
I have some data. 224,000 rows of it, in a SQLite database. I want to extract time series information from it to feed a data visualisation tool. Essentially, each row in the db is an event that has (among other things not strictly relevant) a time-date group in seconds since the epoch and a name responsible for it. I want to extract how many events each name has for every week in the db.
这很简单:
SELECT COUNT(*),
name,
strf("%W:%Y", time, "unixepoch")
FROM events
GROUP BY strf("%W:%Y", time, "unixepoch"), name
ORDER BY time
我们得到大约六千行数据.
and we get about six thousand rows of data.
count name week:year
23............ fudge.......23:2009
etc...
但我不希望每个名字在每周都有一行 - 我希望每个名字都有一行,每周有一列,就像这样:
But I don't want a row for each name in each week - I want a row for each name, and a column for each week, like this:
Name 23:2009 24:2009 25:2009
fudge........23............6............19
fish.........1.............0............12
etc...
现在,监控过程已经运行了 69 周,唯一名称的数量是 502.很明显,我并不热衷于任何涉及硬编码所有列的解决方案,更不用说行了.我不太了解任何涉及迭代的事情,比如 python 的 executemany(),但如果有必要,我愿意接受它.SQL 应该是设置明智的,该死的.
Now, the monitoring process has been running for 69 weeks, and the count of unique names is 502. So clearly, I'm far from keen on any solution that involves hardcoding all the columns and still less the rows. I'm less unkeen on anything that involves iterating over the lot, say with python's executemany(), but I'm willing to accept it if necessary. SQL is meant to be set-wise, dammit.
推荐答案
在这种情况下,一个好的方法是不要将 SQL 推到令人费解且难以理解和维护的地步.让 SQL 尽其所能,并在 Python 中对查询结果进行后处理.
A good approach in cases like this is not to push SQL to the point where it becomes convoluted and hard to understand and maintain. Let SQL do what it conveniently can and post-process the query results in Python.
这是我编写的简单交叉表生成器的精简版.完整版提供行/列/总计.
Here's a cut-down version of a simple crosstab generator that I wrote. The full version delivers row/column/grand totals.
您会注意到它具有内置的分组依据"——原始用例是用于汇总使用 Python 和 xlrd 从 Excel 文件中获得的数据.
You'll note that it has built-in "group by" -- the original use-case was for summarising data obtained from Excel files using Python and xlrd.
您提供的 row_key
和 col_key
不需要像示例中那样是字符串;它们可以是元组——例如(year, week)
在你的情况下——或者它们可以是整数——例如你有一个字符串列名到整数排序键的映射.
The row_key
and col_key
that you supply don't need to be strings as in the example; they can be tuples -- e.g. (year, week)
in your case -- or they could be integers -- e.g. you have a mapping of string column name to integer sort key.
import sys
class CrossTab(object):
def __init__(
self,
missing=0, # what to return for an empty cell. Alternatives: '', 0.0, None, 'NULL'
):
self.missing = missing
self.col_key_set = set()
self.cell_dict = {}
self.headings_OK = False
def add_item(self, row_key, col_key, value):
self.col_key_set.add(col_key)
try:
self.cell_dict[row_key][col_key] += value
except KeyError:
try:
self.cell_dict[row_key][col_key] = value
except KeyError:
self.cell_dict[row_key] = {col_key: value}
def _process_headings(self):
if self.headings_OK:
return
self.row_headings = list(sorted(self.cell_dict.iterkeys()))
self.col_headings = list(sorted(self.col_key_set))
self.headings_OK = True
def get_col_headings(self):
self._process_headings()
return self.col_headings
def generate_row_info(self):
self._process_headings()
for row_key in self.row_headings:
row_dict = self.cell_dict[row_key]
row_vals = [row_dict.get(col_key, self.missing) for col_key in self.col_headings]
yield row_key, row_vals
def dump(self, f=None, header=None, footer='', ):
if f is None:
f = sys.stdout
alist = self.__dict__.items()
alist.sort()
if header is not None:
print >> f, header
for attr, value in alist:
print >> f, "%s: %r" % (attr, value)
if footer is not None:
print >> f, footer
if __name__ == "__main__":
data = [
['Rob', 'Morn', 240],
['Rob', 'Aft', 300],
['Joe', 'Morn', 70],
['Joe', 'Aft', 80],
['Jill', 'Morn', 100],
['Jill', 'Aft', 150],
['Rob', 'Aft', 40],
['Rob', 'aft', 5],
['Dozy', 'Aft', 1],
# Dozy doesn't show up till lunch-time
['Nemo', 'never', -1],
]
NAME, TIME, AMOUNT = range(3)
xlate_time = {'morn': "AM", "aft": "PM"}
print
ctab = CrossTab(missing=None, )
# ctab.dump(header='=== after init ===')
for s in data:
ctab.add_item(
row_key=s[NAME],
col_key= xlate_time.get(s[TIME].lower(), "XXXX"),
value=s[AMOUNT])
# ctab.dump(header='=== after add_item ===')
print ctab.get_col_headings()
# ctab.dump(header='=== after get_col_headings ===')
for x in ctab.generate_row_info():
print x
输出:
['AM', 'PM', 'XXXX']
('Dozy', [None, 1, None])
('Jill', [100, 150, None])
('Joe', [70, 80, None])
('Nemo', [None, None, -1])
('Rob', [240, 345, None])
这篇关于透视 SQLite 表,像 SQL 一样设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!