本文介绍了透视 SQLite 表,像 SQL 一样设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据.SQLite 数据库中的 224,000 行.我想从中提取时间序列信息以提供数据可视化工具.本质上,db 中的每一行都是一个事件,它具有(除其他外不严格相关的)自纪元以来以秒为单位的时间-日期组和负责它的名称.我想在数据库中提取每个名称每周有多少事件.

I have some data. 224,000 rows of it, in a SQLite database. I want to extract time series information from it to feed a data visualisation tool. Essentially, each row in the db is an event that has (among other things not strictly relevant) a time-date group in seconds since the epoch and a name responsible for it. I want to extract how many events each name has for every week in the db.

这很简单:

SELECT COUNT(*),
       name,
       strf("%W:%Y", time, "unixepoch")
  FROM events
 GROUP BY strf("%W:%Y", time, "unixepoch"), name
 ORDER BY time

我们得到大约六千行数据.

and we get about six thousand rows of data.

count          name        week:year
23............ fudge.......23:2009
etc...

但我不希望每个名字在每周都有一行 - 我希望每个名字都有一行,每周有一列,就像这样:

But I don't want a row for each name in each week - I want a row for each name, and a column for each week, like this:

Name      23:2009       24:2009    25:2009
fudge........23............6............19
fish.........1.............0............12
etc...

现在,监控过程已经运行了 69 周,唯一名称的数量是 502.很明显,我并不热衷于任何涉及硬编码所有列的解决方案,更不用说行了.我不太了解任何涉及迭代的事情,比如 python 的 executemany(),但如果有必要,我愿意接受它.SQL 应该是设置明智的,该死的.

Now, the monitoring process has been running for 69 weeks, and the count of unique names is 502. So clearly, I'm far from keen on any solution that involves hardcoding all the columns and still less the rows. I'm less unkeen on anything that involves iterating over the lot, say with python's executemany(), but I'm willing to accept it if necessary. SQL is meant to be set-wise, dammit.

推荐答案

在这种情况下,一个好的方法是不要将 SQL 推到令人费解且难以理解和维护的地步.让 SQL 尽其所能,并在 Python 中对查询结果进行后处理.

A good approach in cases like this is not to push SQL to the point where it becomes convoluted and hard to understand and maintain. Let SQL do what it conveniently can and post-process the query results in Python.

这是我编写的简单交叉表生成器的精简版.完整版提供行/列/总计.

Here's a cut-down version of a simple crosstab generator that I wrote. The full version delivers row/column/grand totals.

您会注意到它具有内置的分组依据"——原始用例是用于汇总使用 Python 和 xlrd 从 Excel 文件中获得的数据.

You'll note that it has built-in "group by" -- the original use-case was for summarising data obtained from Excel files using Python and xlrd.

您提供的 row_keycol_key 不需要像示例中那样是字符串;它们可以是元组——例如(year, week) 在你的情况下——或者它们可以是整数——例如你有一个字符串列名到整数排序键的映射.

The row_key and col_key that you supply don't need to be strings as in the example; they can be tuples -- e.g. (year, week) in your case -- or they could be integers -- e.g. you have a mapping of string column name to integer sort key.

import sys

class CrossTab(object):

    def __init__(
        self,
        missing=0, # what to return for an empty cell. Alternatives: '', 0.0, None, 'NULL'
        ):
        self.missing = missing
        self.col_key_set = set()
        self.cell_dict = {}
        self.headings_OK = False

    def add_item(self, row_key, col_key, value):
        self.col_key_set.add(col_key)
        try:
            self.cell_dict[row_key][col_key] += value
        except KeyError:
            try:
                self.cell_dict[row_key][col_key] = value
            except KeyError:
                self.cell_dict[row_key] = {col_key: value}

    def _process_headings(self):
        if self.headings_OK:
            return
        self.row_headings = list(sorted(self.cell_dict.iterkeys()))
        self.col_headings = list(sorted(self.col_key_set))
        self.headings_OK = True

    def get_col_headings(self):
        self._process_headings()
        return self.col_headings

    def generate_row_info(self):
        self._process_headings()
        for row_key in self.row_headings:
            row_dict = self.cell_dict[row_key]
            row_vals = [row_dict.get(col_key, self.missing) for col_key in self.col_headings]
            yield row_key, row_vals

    def dump(self, f=None, header=None, footer='', ):
        if f is None:
            f = sys.stdout
        alist = self.__dict__.items()
        alist.sort()
        if header is not None:
            print >> f, header
        for attr, value in alist:
            print >> f, "%s: %r" % (attr, value)
        if footer is not None:
            print >> f, footer

if __name__ == "__main__":

    data = [
        ['Rob', 'Morn', 240],
        ['Rob', 'Aft',  300],
        ['Joe', 'Morn',  70],
        ['Joe', 'Aft',   80],
        ['Jill', 'Morn', 100],
        ['Jill', 'Aft',  150],
        ['Rob', 'Aft',   40],
        ['Rob', 'aft',    5],
        ['Dozy', 'Aft',   1],
        # Dozy doesn't show up till lunch-time
        ['Nemo', 'never', -1],
        ]
    NAME, TIME, AMOUNT = range(3)
    xlate_time = {'morn': "AM", "aft": "PM"}

    print
    ctab = CrossTab(missing=None, )
    # ctab.dump(header='=== after init ===')
    for s in data:
        ctab.add_item(
            row_key=s[NAME],
            col_key= xlate_time.get(s[TIME].lower(), "XXXX"),
            value=s[AMOUNT])
        # ctab.dump(header='=== after add_item ===')
    print ctab.get_col_headings()
    # ctab.dump(header='=== after get_col_headings ===')
    for x in ctab.generate_row_info():
        print x

输出:

['AM', 'PM', 'XXXX']
('Dozy', [None, 1, None])
('Jill', [100, 150, None])
('Joe', [70, 80, None])
('Nemo', [None, None, -1])
('Rob', [240, 345, None])

这篇关于透视 SQLite 表,像 SQL 一样设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 03:28