本文介绍了如何在TraMineR和汇总序列数据中使用差异分析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于我有一个很大的数据集并且只有有限的计算资源,所以我想为TraMineR和WeightedCluster ="nofollow noreferrer">差异分析.但是我很难找到正确的语法.

As I have a big dataset and only limited computational ressources, I want to make use of aggregated sequence objects for a discrepancy analysis using the R packages TraMineR and WeightedCluster. But I struggle to find the right syntax for doing so.

在下面的示例代码中,您发现了两个差异分析,差异分析的第一个树图使用原始数据集,第二个使用汇总数据(仅是按其频率加权的唯一序列).很遗憾,结果不匹配.你知道为什么吗?

In the example code below you find two discrepancy analyses, the first tree diagramm of the discrepancy analysis uses the original dataset, the second uses aggregated data (that is only unique sequences weighted by their frequencies).
Unfortunately, the results do not match. Do you have any idea why?

示例代码

library(TraMineR)
library(WeightedCluster)

## Load example data and assign labels
data(mvad)
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.labels <- c("Employment", "Further Education", "Higher Education",
                 "Joblessness", "School", "Training")
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")

## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, 17:86], weights=mvad$weight)
mvad.agg

## Define sequence object
mvad.seq <- seqdef(mvad[, 17:86], alphabet=mvad.alphabet, states=mvad.scodes,
                   labels=mvad.labels, weights=mvad$weight, xtstep=6)
mvad.agg.seq <- seqdef(mvad[mvad.agg$aggIndex, 17:86], alphabet=mvad.alphabet,
                       states=mvad.scodes, labels=mvad.labels,
                       weights=mvad.agg$aggWeights, xtstep=6)

## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="OM", indel=1.5, sm="CONSTANT")
mvad.agg.dist <- seqdist(mvad.agg.seq, method="OM", indel=1.5, sm="CONSTANT")

## Discrepancy analysis
tree <- seqtree(mvad.seq ~ gcse5eq + Grammar + funemp,
                data=mvad, diss=mvad.dist, weight.permutation="diss")
seqtreedisplay(tree, type="d", border=NA)
tree.agg <- seqtree(mvad.agg.seq ~ gcse5eq + Grammar + funemp,
                    data=mvad[mvad.agg$aggIndex, ], diss=mvad.agg.dist,
                    weight.permutation="diss")
seqtreedisplay(tree.agg, type="d", border=NA)

推荐答案

您用于聚合数据的过程是错误的,因为在聚合数据时您不考虑解释性协变量.因此,每个唯一序列都归因于几乎随机协变量配置文件,从而得出错误的结果.

The procedure you are using for aggregated data is wrong, because you do not consider explanatory covariates when aggregating the data. Because of that each unique sequence is attributed to an almost random covariate profile, giving wrong results.

您需要做的是汇总序列.这里的协变量"Grammar""funemp""gcse5eq"位于第10到第12列.因此

What you need to do is aggregating sequence and covariates. Here covariates "Grammar" "funemp" "gcse5eq" are located in columns 10 to 12. So

## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, c(10:12, 17:86)], weights=mvad$weight)
mvad.agg

然后我们遇到下一个问题:置换测试.如果不执行任何操作,则将仅置换聚合(并忽略聚合内部的置换),从而为您提供错误的p值.可以使用两种解决方案:

We then come to the next problem: permutation test. If you do nothing, you will permute only aggregates (and omit permutations inside aggregates) giving you wrong p-values. Two solutions can be used:

  • 如果没有采样权重,请使用weight.permutation ="replicate"告诉该过程,以格单位为1的方式对聚合内进行置换.
  • 如果您有抽样权重,那么没有完美的程序.您可以使用weight.permutation ="random-sampling"(使用权重定义的分布将协变量配置文件随机分配给对象).

在所有情况下,您可能会观察到p值的细微差异(因为您使用的是不同的过程),并且因为p值是使用置换检验估算的.为了获得更精确的p值,请尝试使用更高的R值(排列数).在树形过程中,可以使用pval参数更改进行拆分的最小p值.您可以尝试将其设置为更高一点,以查看差异是否来自此处.

In all the cases, you may observe small differences of p-values (because you have a different procedure), and also because p-values are estimated using permutation tests. To get more precise p-value try to use an higher R value (number of permutations). In the tree procedure, the minimum p-value to make a split can be changed using the pval argument. You can try to set it just a little higher to see if the differences come from here.

希望对您有帮助.

这篇关于如何在TraMineR和汇总序列数据中使用差异分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 05:26