问题描述
这是数据集:
company <- c("Coca-Cola Inc.", "DF, CocaCola",
"COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol <- c("2456","1653", "19", "2766", "167")
data <-data.frame(company, brand, vol)
data
结果:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 PepsiCo Inc. Pepsi 2766
5 Beverages Distribution soft drink 167
比方说,这是按品牌进口的数量.
Let's say, this is imported volume by brand.
任务是将数据框细分为仅查看与可口可乐相关的观察结果,而不是任何其他品牌.
- 问题在于可口可乐的书写方式多种多样.
- 另外,我们知道饮料分销公司只进口可口可乐,即使上表中没有标明.
我们需要根据条件(键)列表部分匹配 COMPANY 和 BRAND 变量:
We need to partially match COMPANY and BRAND variables against a list of criteria (keys):
company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")
我正在努力执行这个想法:
子集数据如果品牌部分匹配来自brand_key向量的任何键或公司部分匹配来自company_key的任何键
所以,只留下以下几行:
So, leave only the lines in which :
(brand 观察部分匹配coca-" OR cocacola" OR coca cola")
(brand observation partially matches "coca-" OR "cocacola" OR "coca cola")
或
(company 观察部分匹配coca-" OR cocacola" OR coca cola" OR beverages distribution")
(company observation partially matches "coca-" OR "cocacola" OR "coca cola" OR "beverages distribution")
注意:需要不区分大小写
理想的输出:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 Beverages Distribution soft drink 167
有什么想法吗?提前致谢:)
Any ideas? Thanks in advance :)
推荐答案
使用正则表达式及其 |
(或)运算符.参数 ignore.case
处理案例.
Using regex and its |
(or) operator. Parameter ignore.case
deals with the case.
index <- grepl(paste0(company_key, collapse = "|"), data$company, ignore.case = TRUE) |
grepl(paste0(brand_key, collapse = "|"), data$company, ignore.case = TRUE)
data[index,]
# company brand vol
#1 Coca-Cola Inc. Coca-Cola Zero 2456
#2 DF, CocaCola N/A 1653
#3 COCA-COLA Coca-Cola 19
#5 Beverages Distribution soft drink 167
这篇关于使用具有多个条件的部分匹配对 df 进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!