我试图分析呼叫概率和车辆距离之间的关系。
示例数据集(here csv)如下所示:
id day time called d
1 2009-06-24 1700 0 1037.6
1 2009-06-24 1710 1 1191.9
1 2009-06-24 1720 0 165.5
真正的数据集有1000万行。在10分钟的不同时间窗口中,有代表调用或不调用的位置的
id
s。我想首先删除所有具有相同id的行,这些行在整个期间的任何日期都从未调用过。
然后剩下的行表示在给定时间的分析过程中某一天调用的
id
s。我想创建一个变量,该变量在调用行中的值为
0
且在调用前一天(或小时、周、月,无论如何,但这里是天)同时等于-1
和+1
后一天,等等。稍后,我将该变量与called
和distance
一起用作输入,以便在不同位置进行分析和比较我已经找了其他回答过的问题,但没有找到合适的答案。所以,请回答或指出一个问题。我正在使用Stata 13,但也欢迎用Postgres 9.3或R解决这个问题。
我需要对多个数据集重复此过程多次,因此理想情况下,我希望尽可能实现自动化。
更新:
Here is所需结果的示例:
id day time called d newvar newvar2
1 2009-06-24 1700 0 1037.6 null
1 2009-06-24 1710 1 1191.9 0 -2
1 2009-06-24 1720 0 165.5 -1
1 2009-06-25 1700 0 526.7 null
1 2009-06-25 1710 0 342.5 1 -1
1 2009-06-25 1720 1 416.1 0
1 2009-06-26 1700 0 428.3 null
1 2009-06-26 1710 1 240.7 2 0
1 2009-06-26 1720 0 228.7 1
1 2009-06-27 1700 0 282.5 null
1 2009-06-27 1710 0 182.1 3 1
1 2009-06-27 1720 0 195.5 2
2 2009-06-24 1700 0 198.0 -1
2 2009-06-24 1710 0 157.4 null
2 2009-06-24 1720 0 234.9 null
2 2009-06-25 1700 1 247.0 0
我添加了
newvar2
,因为某些位置可能在给定的时间窗口调用多次 最佳答案
在寻找Stata解决方案时,最好使用dataex
(来自SSC)提供一个数据示例。
在数据按id
和time
排序(并按day
进一步排序)之前,这个问题很难可视化。我没有将day
变量转换为Stata数字日期,因为在构造时,字符串排序顺序与自然日期顺序匹配。
对于id time
组中的每个呼叫,您似乎需要与呼叫日期相关的日期偏移量。这可以通过生成一个order变量来跟踪每个id time
组中当前观测的索引,然后减去进行调用的观测的索引来完成。
由于每个时隙可以有多个调用,所以必须在数据中的任意给定时隙中循环调用的最大数量。
与您的解决方案相比,此解决方案生成的结果有一个不同之处:您似乎忽略了在2009-06-27
中对1710
的调用。
在下面的示例中,原始数据按id == 2
排序,以便读者更好地了解发生了什么。
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 day int time byte called float distance str4 newvar byte newvar2
1 "2009-06-24" 1700 0 1037.6 "null" .
1 "2009-06-25" 1700 0 526.7 "null" .
1 "2009-06-26" 1700 0 428.3 "null" .
1 "2009-06-27" 1700 0 282.5 "null" .
1 "2009-06-24" 1710 1 1191.9 "0" -2
1 "2009-06-25" 1710 0 342.5 "1" -1
1 "2009-06-26" 1710 1 240.7 "2" 0
1 "2009-06-27" 1710 0 182.1 "3" 1
1 "2009-06-24" 1720 0 165.5 "-1" .
1 "2009-06-25" 1720 1 416.1 "0" .
1 "2009-06-26" 1720 0 228.7 "1" .
1 "2009-06-27" 1720 0 195.5 "2" .
2 "2009-06-24" 1700 0 198 "-1" .
2 "2009-06-25" 1700 1 247 "0" .
2 "2009-06-26" 1700 0 188.7 "1" .
2 "2009-06-27" 1700 0 203.5 "2" .
2 "2009-06-24" 1710 0 157.4 "null" .
2 "2009-06-25" 1710 0 221.3 "null" .
2 "2009-06-26" 1710 0 283.8 "null" .
2 "2009-06-27" 1710 1 91.7 "null" .
2 "2009-06-24" 1720 0 234.9 "null" .
2 "2009-06-25" 1720 0 249.6 "null" .
2 "2009-06-26" 1720 0 279.7 "null" .
2 "2009-06-27" 1720 0 198.2 "null" .
3 "2009-06-24" 1700 0 156.1 "-1" .
3 "2009-06-25" 1700 1 19.9 "0" .
3 "2009-06-26" 1700 0 195.2 "1" .
3 "2009-06-27" 1700 0 306.2 "2" .
3 "2009-06-24" 1710 0 150.1 "null" .
3 "2009-06-25" 1710 0 163.7 "null" .
3 "2009-06-26" 1710 0 288.2 "null" .
3 "2009-06-27" 1710 0 311.7 "null" .
3 "2009-06-24" 1720 0 135.1 "-2" .
3 "2009-06-25" 1720 0 186 "-1" .
3 "2009-06-26" 1720 1 297.2 "0" .
3 "2009-06-27" 1720 0 375.9 "1" .
end
* order observations by date within a id time group
sort id time day
by id time: gen order = _n
* number of calls at any given time
by id time: gen call = sum(called)
* repeat enough to cover the max number of calls per time
sum call, meanonly
local n = r(max)
forvalues i = 1/`n' {
// the index of the called observation in the id time group
by id time: gen index = order if called & call == `i'
// replicate the index for all observations in the id time group
by id time: egen gindex = total(index)
// the relative position of each obs in groups with a call
gen wanted`i' = order - gindex if gindex > 0
drop index gindex
}
list, sepby(id time) noobs compress
以及结果
. list, sepby(id time) noobs compress
+----------------------------------------------------------------------------------------+
| id day time cal~d dist~e new~r new~2 order call wan~1 wan~2 |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1700 0 1037.6 null . 1 0 . . |
| 1 2009-06-25 1700 0 526.7 null . 2 0 . . |
| 1 2009-06-26 1700 0 428.3 null . 3 0 . . |
| 1 2009-06-27 1700 0 282.5 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1710 1 1191.9 0 -2 1 1 0 -2 |
| 1 2009-06-25 1710 0 342.5 1 -1 2 1 1 -1 |
| 1 2009-06-26 1710 1 240.7 2 0 3 2 2 0 |
| 1 2009-06-27 1710 0 182.1 3 1 4 2 3 1 |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1720 0 165.5 -1 . 1 0 -1 . |
| 1 2009-06-25 1720 1 416.1 0 . 2 1 0 . |
| 1 2009-06-26 1720 0 228.7 1 . 3 1 1 . |
| 1 2009-06-27 1720 0 195.5 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1700 0 198 -1 . 1 0 -1 . |
| 2 2009-06-25 1700 1 247 0 . 2 1 0 . |
| 2 2009-06-26 1700 0 188.7 1 . 3 1 1 . |
| 2 2009-06-27 1700 0 203.5 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1710 0 157.4 null . 1 0 -3 . |
| 2 2009-06-25 1710 0 221.3 null . 2 0 -2 . |
| 2 2009-06-26 1710 0 283.8 null . 3 0 -1 . |
| 2 2009-06-27 1710 1 91.7 null . 4 1 0 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1720 0 234.9 null . 1 0 . . |
| 2 2009-06-25 1720 0 249.6 null . 2 0 . . |
| 2 2009-06-26 1720 0 279.7 null . 3 0 . . |
| 2 2009-06-27 1720 0 198.2 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1700 0 156.1 -1 . 1 0 -1 . |
| 3 2009-06-25 1700 1 19.9 0 . 2 1 0 . |
| 3 2009-06-26 1700 0 195.2 1 . 3 1 1 . |
| 3 2009-06-27 1700 0 306.2 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1710 0 150.1 null . 1 0 . . |
| 3 2009-06-25 1710 0 163.7 null . 2 0 . . |
| 3 2009-06-26 1710 0 288.2 null . 3 0 . . |
| 3 2009-06-27 1710 0 311.7 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1720 0 135.1 -2 . 1 0 -2 . |
| 3 2009-06-25 1720 0 186 -1 . 2 0 -1 . |
| 3 2009-06-26 1720 1 297.2 0 . 3 1 0 . |
| 3 2009-06-27 1720 0 375.9 1 . 4 1 1 . |
+----------------------------------------------------------------------------------------+
关于postgresql - 在面板数据上按组以及时间和日期创建条件变量,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/43705646/