问题描述
我需要一种快速而简洁的方法来将数据帧中的字符串文字拆分为一组列.假设我有这个数据框
I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame
data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )
(请注意列之间的不同分隔符)
(pls note the different delimiters among columns)
通常事先不知道字符串列的数量(尽管如果我没有其他选择,我可以尝试发现整个案例集)
The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives)
我需要两个这样的数据框:
I need two data frames like those:
tok1.occurrences:
+----+---+---+---+---+---+
| id | a | b | c | d | e |
+----+---+---+---+---+---+
| 1 | 1 | 1 | 1 | 0 | 0 |
| 2 | 2 | 0 | 0 | 1 | 0 |
| 3 | 0 | 1 | 0 | 1 | 1 |
+----+---+---+---+---+---+
tok2.occurrences:
+----+-------+-------+---------+-------+-------+
| id | alpha | bravo | charlie | delta | tango |
+----+-------+-------+---------+-------+-------+
| 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 | 1 | 2 |
+----+-------+-------+---------+-------+-------+
我尝试使用这种语法:
tok1.f = factor(data$tok1)
dummies <- model.matrix(~tok1.f)
这最终得到了一个不完整的解决方案.它正确地创建了我的虚拟变量,但没有(显然)根据分隔符进行拆分.
this ended up in a incomplete solution. It creates my dummy vars correctly, but not (obviously) splitting against the delimiter.
我知道我可以使用 'tm' 包来查找文档-术语矩阵,但对于这种简单的标记化来说似乎太过分了.有没有更直接的方法?
I know i can use the 'tm' package to find a document-term matrix, but it's seems way too much for such simple tokenization. Is there a more straight way?
推荐答案
我能想到的最简单的方法就是使用 我的 cSplit
函数 与 dcast.data.table
结合,像这样:
The easiest thing that I can think of is to use my cSplit
function in conjunction with dcast.data.table
, like this:
library(splitstackshape)
dcast.data.table(cSplit(data, "tok1", ", ", "long"),
id ~ tok1, value.var = "tok1",
fun.aggregate = length)
# id a b c d e
# 1: 1 1 1 1 0 0
# 2: 2 2 0 0 1 0
# 3: 3 0 1 0 1 1
dcast.data.table(cSplit(data, "tok2", "|", "long"),
id ~ tok2, value.var = "tok2",
fun.aggregate = length)
# id alpha bravo charlie delta tango
# 1: 1 1 1 0 0 0
# 2: 2 1 0 1 0 0
# 3: 3 0 0 0 1 2
使用 library(splitstackshape)
更新,因为 cSplit
现在是该包的一部分.
Updated with library(splitstackshape)
since cSplit
is now part of that package.
这篇关于将分隔的字符串拆分为 R 数据框中的不同列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!