问题描述
假设我有一个数据集,其中长度1的序列是非法的,长度2是合法的,大于长度5是非法的,但允许将更长的序列断开成< = 5序列。
set.seed(1)
DT1 DT1 [,smp:= 1:length(smp)]
DT1 [,Seq:= seq 0,abs(diff(R))))]
最后一行直接来自:
在数据中创建序列。
DT1 [,fix_min:= ifelse((R == TRUE& Seq = = 1)|(R == FALSE),FALSE,TRUE)]
fixmin_idx2< - which(DT1 [,fix_min == TRUE])
DT1 [fixmin_idx2 -1,fix_min:= TRUE]
现在我的长度2 legals已正确标记。分隔> 5秒。
DT1 [R == 1& Seq == 6,fix_min:= FALSE]
DT1 [,Seq2:= seq(.N),by = list(cumsum(c(0,abs(diff(fix_min))) b DT1 [R == 1& Seq2 == 6,fix_min:= FALSE]
fixSeq2_idx7< - 它(DT1 [,fix_min == TRUE]& DT1 [,Seq2 == 7])
fixSeq2_idx7
[1 ] 10203 13228
DT1 [fixSeq2_idx7,]
smp R Seq fix_min Seq2
1:10203 1 13 TRUE 7
2:13228 1 13 TRUE 7
DT1 [fixSeq2_idx7 + 1,]
smp R Seq fix_min Seq2
1:10204 1 14 TRUE 8
2:13229 0 1 FALSE 1
现在要测试一个Seq2 == 7后面是一个Seq2 == 8,这将是一个合法的2长度。我一个7跟随一个8和一个没有跟随一个8.有我被卡住了。我尝试的一切都将所有fix_min设置为TRUE或交替TRUE和FALSE。
任何指导都非常感激。
如果我正确理解你的问题,你想将 fix_min
设置为 FALSE
当 R == 0
或 R == 1& (1 =< Seq< 6 | Seq> 6)
。然后下面应该给你你想要的:
#从你的第一个代码块中重新创建数据
创建具有
set.seed (1)
DT1 ] [,Seq:= seq N),by = rleid(R)
] [,Seq2:= Seq [.N],by = rleid(R)]
添加所需的'fix_min'列
DT1 [,fix_min:=(R == 1& Seq [.N]> 1& Seq %% 6!= 0),by = rleid(R)
] [R == 1 & Seq %% 6 == 1& Seq2 %% 6 == 1&说明:: p>
data.table(R = sample(0:1,20000,rep = TRUE))
创建 data.table的基础
[,smp:=。I]
并将其添加到 data.table
by = rleid(R)
;看看它是什么尝试:data.table(R = sample(0:1,20000,rep = TRUE))[,seq.id:=rleid(R)]
[,Seq:= seq(.N),by = rleid(R)]
为每个序列创建一个索引,将其添加到 data.table ;序列由rleid(R)
- = 1& Seq [.N]> 1& Seq %% 6!= 0)
[,Seq2:= Seq [.N] $
fix_min:=(R = a)
TRUE
R == 1
&序列的长度大于一个(Seq [.N]> 1
),排除序列号是6的倍数的值
(Seq %% 6!= 0
)
R == 1 & Seq %% 6 == 1& Seq2 %% 6 == 1& Seq == Seq2
过滤 data.table 如下: R == 1
序列值为 7
, 13
, 19
等( Seq %% 6 == 1
)&序列的长度 7
, 13
, 19
,等等,并且只从满足其他条件的序列中选择最后一行( Seq == Seq2
)。使用 fix_min:= FALSE
将它们设置为 FALSE
。Say I have a data set where sequences of length 1 are illegal, length 2 are legal, greater than length 5 are illegal but it is allowed to break longer sequences up into <=5 sequences.
set.seed(1)
DT1 <- data.table(smp = 1, R=sample(0:1, 20000, rep=TRUE), Seq = 0L)
DT1[, smp:=1:length(smp)]
DT1[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
This last line comes directly from:Creating a sequence in a data.table depending on a column
DT1[, fix_min:=ifelse((R==TRUE & Seq==1) | (R==FALSE), FALSE, TRUE)]
fixmin_idx2 <- which(DT1[, fix_min==TRUE])
DT1[fixmin_idx2 -1, fix_min:=TRUE]
Now my length 2 legals are properly marked. Break up the >5s.
DT1[R==1 & Seq==6, fix_min:=FALSE]
DT1[,Seq2:=seq(.N), by=list(cumsum(c(0, abs(diff(fix_min)))))]
DT1[R==1 & Seq2==6, fix_min:=FALSE]
fixSeq2_idx7 <- which(DT1[,fix_min==TRUE] & DT1[,Seq2==7])
fixSeq2_idx7
[1] 10203 13228
DT1[fixSeq2_idx7,]
smp R Seq fix_min Seq2
1: 10203 1 13 TRUE 7
2: 13228 1 13 TRUE 7
DT1[fixSeq2_idx7 + 1,]
smp R Seq fix_min Seq2
1: 10204 1 14 TRUE 8
2: 13229 0 1 FALSE 1
And now to test if a Seq2==7 is followed by an Seq2==8, which would be a legal 2 length. I one 7 followed by an 8 and one not followed by an 8. And there I'm stuck. Everything I've tried either sets all fix_min to TRUE or alternation TRUE and FALSE.
Any guidance greatly appreciated.
If I understand your question correctly, you want to set the fix_min
to FALSE
when R == 0
or when R == 1 & (1 =< Seq < 6 | Seq > 6)
. Then the following should give you what you want:
# recreating the data from your first code block
set.seed(1)
DT1 <- data.table(R=sample(0:1, 20000, rep=TRUE))[, smp:=.I
][, Seq:=seq(.N), by=rleid(R)
][, Seq2 := Seq[.N], by=rleid(R)]
# adding the needed 'fix_min' column
DT1[, fix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0), by=rleid(R)
][R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2, fix_min := FALSE]
Explanation:
data.table(R=sample(0:1, 20000, rep=TRUE))
creates the base of the data.table[, smp:=.I]
creates an index and adds it to the data.tableby=rleid(R)
identifies the sequences; to see what it does try:data.table(R=sample(0:1, 20000, rep=TRUE))[, seq.id:=rleid(R)]
[, Seq:=seq(.N), by=rleid(R)]
creates an index for each sequence and adds it to the data.table; the sequences are identified byrleid(R)
[, Seq2 := Seq[.N], by=rleid(R)]
creates a variable that contains a value indicating the length of the sequencefix_min := (R==1 & Seq[.N] > 1 & Seq%%6!=0)
creates a logical vector withTRUE
values whereR==1
& the length of the sequence is larger than one (Seq[.N] > 1
) excluding the values where the sequence number is a multiple of6
(Seq%%6!=0
)R==1 & Seq%%6==1 & Seq2%%6==1 & Seq==Seq2
filters the data.table as follows: only rows whereR==1
& the sequence value is7
,13
,19
, etc (Seq%%6==1
) & the length of the sequence is7
,13
,19
, etc and only selects the last row (Seq==Seq2
) from the sequences that meet the other conditions. Withfix_min := FALSE
you set them toFALSE
.
这篇关于使用data.table建立索引序列块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!