如何创建根据ID/R中的行从组合计算出的共现矩阵?

本文介绍了如何创建根据ID/R中的行从组合计算出的共现矩阵?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

感谢@jazzurro的忙.这使我意识到重复项可能会使事情复杂化.我希望仅保留唯一的值/行可以简化任务.*

Thanks to @jazzurro for his anwer. It made me realize that the duplicates may just complicate things. I hope by keeping only unique values/row simplifies the task.*

df <- data.frame(ID = c(1,2,3,4,5),
                  CTR1 = c("England", "England", "England", "China", "Sweden"),
                  CTR2 = c("England", "China", "China", "England", NA),
                  CTR3 = c("USA", "USA", "USA", "USA", NA),
                  CTR4 = c(NA, NA, NA, NA, NA),
                  CTR5 = c(NA, NA, NA, NA, NA),
                  CTR6 = c(NA, NA, NA, NA, NA))


ID CTR1    CTR2    CTR3 CTR4 CTR5 CTR6
1  England China   USA
2  England China   USA
3  England China   USA
4  China   England USA
5  Sweden

基于以下四个条件创建共现矩阵(现在)仍然是目标:

It is still the goal to create a co-occurrence matrix (now) based on the following four conditions:

不考虑按ID/行列出的没有其他观察结果的单个观察结果，即只有一个国家/地区的行被计为0.

Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.

组合/同时出现应计为1.

A combination/co-occurrence should be counted as 1.

处于组合状态也将被视为自组合(美国-美国)，即，将值分配为1.

Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.

按行/ID分配给组合的值不超过1.

There is no value over 1 assigned to a combination by row/ID.

理想结果

         China   England   USA   Sweden

China    4        4         4      0

England  4        4         4      0

USA      4        4         4      0

Sweden   0        0         0      0

*我已经使用了，以删除所有非唯一的观察结果.

*I've used the code from here to remove all non-unique observations.

假设我有一个数据集，该数据集的列数较低(某些NA/空)，并且行数超过100.000，由以下示例数据框表示

Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe

df <- data.frame(ID = c(1,2,3,4,5),
                  CTR1 = c("England", "England", "England", "China", "England"),
                  CTR2 = c("England", "China", "China", "England", NA),
                  CTR3 = c("England", "China", "China", "England", NA),
                  CTR4 = c("China", "USA", "USA", "China", NA),
                  CTR5 = c("USA", "England", "USA", "USA", NA),
                  CTR6 = c("England", "China", "USA", "England", NA))


df

ID   CTR1    CTR2    CTR3    CTR4   CTR5    CTR6
1    England England England China  USA     England
2    England China   China   USA    England China
3    England China   China   USA    USA     USA
4    China   England England China  USA     England
5    England

，我想对ID/行的同现进行计数，以获得一个共现矩阵，该矩阵仅对ID/行的同现进行一次汇总，这意味着不会将超过1的值分配给组合(例如，为与行内频率和顺序无关的同现分配一个值1，对于没有同现/通过ID/行组合的情况，给一个赋值为0，

and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),

1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0

另一个重要方面涉及连续出现一次但与其他观察值相结合的观察值的计数，例如美国在第1行.他们应该为自己的同现获得1的值(因为即使不是与自己在一起，它们也处于结合状态)，以便USA-USA的结合也得到1的赋值.

Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.

1    England England England China  USA  England

USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1

由于行/ID组合的行数不应大于1，因此导致:

Due to the fact that row count should not >1 for a combination by row/ID, this results to:

        China   England   USA

China    1        1         1

England  1        1         1

USA      1        1         1

这应基于示例数据帧得出以下结果，其中基于每个组合至少出现在四行中并且每个字符串是该组合的一部分的事实，为每个组合分配了值4原始数据框:

This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:

         China   England   USA

China    4        4         4

England  4        4         4

USA      4        4         4

所以有五个条件可以计数:

So there are five conditions for counting:

不考虑按ID/行没有附加观察值的单个观察值，即不计一次仅具有一个国家/地区的行.
组合应计为1.
进行多次观察并不能为互动带来更高的价值，即同一国家的多次观察都无关紧要.
处于组合状态(即使在同一国家/地区未连续出现两次的情况下)也将被算作一种自我组合，即分配的值为1.
按行/ID分配给组合的值不超过1.

我尝试通过使用 dplyr ， data.table ，基本集合或 plyr 来实现此目的调整 [1] ，，[3] ，，和，但由于我不在乎一行中的顺序，但我也不想对一个组中的所有组合求和行，我到目前为止还没有得到理想的结果.

I've tried to implement this by using dplyr, data.table, base aggregate or plyr adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.

我是R语言的新手.非常感谢您的帮助.

I'm a novice in R. Any help is very much appreciated.

推荐答案

数据

我修改了您的数据，以便数据可以代表您的实际情况.

I modified your data so that data can represent your actual situation.

#   ID    CTR1    CTR2    CTR3  CTR4    CTR5    CTR6
#1:  1 England England England China     USA England
#2:  2 England   China   China   USA England   China
#3:  3 England   China   China   USA     USA     USA
#4:  4   China England England China     USA England
#5:  5  Sweden    <NA>    <NA>  <NA>            <NA>


df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))

更新

在看到OP的上一个问题之后，我脑海中清晰地看到了.我想这就是你想要的，塞伯.

After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.

# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA.

melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo

# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2

# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().

crossprod(table(foo2[, c(1,3)])) -> mymat

# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.

diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)

#value
#value     China England Sweden USA
#China       4       4      0   4
#England     4       4      0   4
#Sweden      0       0      0   0
#USA         4       4      0   4

这篇关于如何创建根据ID/R中的行从组合计算出的共现矩阵?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！