PostgreSQL 9.1

业务状况

每个月,都会为特定的流程分配一批新的帐户。可以按月份,帐户数和帐户总余额来描述每批。该过程的目标是从客户那里收回一些余额。
然后每月分别跟踪每个批次(自将批次转移到流程以来,每个月回收的金额)。

目标

我的目标是预测将来将回收多少。

数据定义

create table vintage_data (
    granularity date,       /* Month when account entered process*/
    distance_in_months integer, /* Distance in months from date when accounts entered process*/
    entry_accounts integer,     /* Number of accounts that entered process in a given month*/
    entry_amount numeric,       /* Total amount for account that entered process in a given month*/
    recovery_amount numeric     /* Amount recovered in Nth month on accounts that entered process in a given month */
);

样本数据
insert into vintage_data values('2012-01-31',1,200,100000,1000);
insert into vintage_data values('2012-01-31',2,200,100000,2000);
insert into vintage_data values('2012-01-31',3,200,100000,3000);
insert into vintage_data values('2012-01-31',4,200,100000,3500);
insert into vintage_data values('2012-01-31',5,200,100000,3400);
insert into vintage_data values('2012-01-31',6,200,100000,3300);
insert into vintage_data values('2012-02-28',1,250,150000,1200);
insert into vintage_data values('2012-02-28',2,250,150000,1600);
insert into vintage_data values('2012-02-28',3,250,150000,1800);
insert into vintage_data values('2012-02-28',4,250,150000,1200);
insert into vintage_data values('2012-02-28',5,250,150000,1600);
insert into vintage_data values('2012-03-31',1,200,90000,1300);
insert into vintage_data values('2012-03-31',2,200,90000,1200);
insert into vintage_data values('2012-03-31',3,200,90000,1400);
insert into vintage_data values('2012-03-31',4,200,90000,1000);
insert into vintage_data values('2012-04-30',1,300,180000,1600);
insert into vintage_data values('2012-04-30',2,300,180000,1500);
insert into vintage_data values('2012-04-30',3,300,180000,4000);
insert into vintage_data values('2012-05-31',1,400,225000,2200);
insert into vintage_data values('2012-05-31',2,400,225000,6000);
insert into vintage_data values('2012-06-30',1,100,60000,1000);

计算过程

您可以将数据想象成一个三角形矩阵(将预测X值):
distance_in_months                       1      2     3       4      5      6
granularity entry_accounts  entry_amount
2012-01-31  200             100000       1000   2000   3000   3500   3400   3300
2012-02-28  250             150000       1200   1600   1800   1200   1600   (X-1)
2012-03-31  200              90000       1300   1200   1400   1000   (X0)   (X4)
2012-04-30  300             180000       1600   1500   4000   (X1)   (X5)   (X8)
2012-05-31  400             225000       2200   6000   (X2)   (X6)   (X9)   (X11)
2012-06-30  100              60000       1000   (X3)   (X7)   (X10)  (X12   (X13)

算法

我的目标是预测所有遗漏的点( future )。为了说明该过程,这是对点X1的计算

1)使用不超过4的距离获取前三个月的行总计:
2012-01-31  1000+2000+3000+3500=9500 (d4m3)
2012-02-28  1200+1600+1800+1200=5800 (d4m2)
2012-03-31  1300+1200+1400+1000=4900 (d4m1)

2)使用不超过3的距离获取前三个月的行总计:
2012-01-31  1000+2000+3000=6000 (d3m3)
2012-02-28  1200+1600+1800=4600 (d3m2)
2012-03-31  1300+1200+1400=3800 (d3m1)

3)计算距离3和距离4的加权平均运行率(由entry_amount加权):
(d4m3+d4m2+d4m1)/(100000+150000+90000) = (9500+5800+4900)/(100000+150000+90000) = 20200/340000 = 0.0594
(d3m3+d3m2+d3m1)/(100000+150000+90000) = (6000+4600+3800)/(100000+150000+90000) = 14400/340000 = 0.0424

4)计算距离3和距离4之间的变化
((d4m3+d4m2+d4m1)/(100000+150000+90000))/((d3m3+d3m2+d3m1)/(100000+150000+90000)) =
= (20200/340000)/(14400/340000) =
= 0.0594/0.0424 = 1.403 (PredictionRateForX1)

5)使用不超过3的距离计算预测月份的行总计
2012-04-30  1600+1500+4000=7100

6)使用entry_amount计算预测月份的费率
7100/180000 = 0.0394

7)计算X1的预测速率
0.0394 * PredictionRateForX1 = 0.05534

8)计算X1的数量
(0.05534-0.0394)*180000 = 2869.2

问题

问题是如何使用SQL语句计算矩阵的其余部分(从x-1到x13)。显然,这将需要某种递归算法。

最佳答案

这是一项艰巨的任务,将其拆分以使其更易于管理。我会把它放在带有RETURN TABLE的plpgsql函数中:

  • 使用交叉表查询为“计算过程”矩阵创建一个临时表
    您需要为此安装 tablefunc 模块。运行(每个数据库一次):
    CREATE EXTENSION tablefunc;
    
  • 逐字段更新临时表。
  • 返回表。


  • 以下演示功能完整,并已在PostgreSQL 9.1.4中进行了测试。以问题中提供的表定义为基础:
    -- DROP FUNCTION f_forcast();
    
    CREATE OR REPLACE FUNCTION f_forcast()
      RETURNS TABLE (
      granularity date
     ,entry_accounts numeric
     ,entry_amount numeric
     ,d1 numeric
     ,d2 numeric
     ,d3 numeric
     ,d4 numeric
     ,d5 numeric
     ,d6 numeric) AS
    $BODY$
    BEGIN
    
    --== Create temp table with result of crosstab() ==--
    
    CREATE TEMP TABLE matrix ON COMMIT DROP AS
    SELECT *
    FROM   crosstab (
            'SELECT granularity, entry_accounts, entry_amount
                   ,distance_in_months, recovery_amount
             FROM   vintage_data
             ORDER  BY 1, 2',
    
            'SELECT DISTINCT distance_in_months
             FROM   vintage_data
             ORDER  BY 1')
    AS tbl (
      granularity date
     ,entry_accounts numeric
     ,entry_amount numeric
     ,d1 numeric
     ,d2 numeric
     ,d3 numeric
     ,d4 numeric
     ,d5 numeric
     ,d6 numeric
     );
    
    ANALYZE matrix; -- update statistics to help calculations
    
    
    --== Calculations ==--
    
    -- I implemented the first calculation for X1 and leave the rest to you.
    -- Can probably be generalized in a loop or even a single statement.
    
    UPDATE matrix m
    SET    d4 = (
        SELECT (sum(x.d1) + sum(x.d2) + sum(x.d3) + sum(x.d4))
                /(sum(x.d1) + sum(x.d2) + sum(x.d3)) - 1
                -- removed redundant sum(entry_amount) from equation
        FROM  (
            SELECT *
            FROM   matrix a
            WHERE  a.granularity < m.granularity
            ORDER  BY a.granularity DESC
            LIMIT  3
            ) x
        ) * (m.d1 + m.d2 + m.d3)
    WHERE m.granularity = '2012-04-30';
    
    --- Next update X2 ..
    
    
    --== Return results ==--
    
    RETURN QUERY
    TABLE  matrix
    ORDER  BY 1;
    
    END;
    $BODY$ LANGUAGE plpgsql;
    

    称呼:
    SELECT * FROM f_forcast();
    

    我已经简化了很多,在计算中删除了一些多余的步骤。
    该解决方案采用了多种先进技术。您需要了解PostgreSQL的使用方式。

    10-08 15:01