提取R中每个单词的第一个字母

提取R中每个单词的第一个字母

本文介绍了提取R中每个单词的第一个字母的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些分类变量的 data.frame.让我们假设 sentences 是这些变量之一:

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:

sentences <- c("Direito à participação e ao controle social",
               "Direito a ser ouvido pelo governo e representantes",
               "Direito aos serviços públicos",
               "Direito de acesso à informação")

对于每个值,我只想提取每个单词的第一个字母,忽略单词是否有 4 个或更少的字母(e、de、à、a、aos、ser、pelo),我的目标是创建首字母缩略词变量.我希望得到以下结果:

For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:

[1] "DPCS", "DOGR", "DSP", "DAI

我尝试使用 stringr 和建立的正则表达式模式创建模式子集 这里:

I tried to make a pattern subset using stringr with a regex pattern founded here:

library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)

但是我在创建 pattern 对象时出错:

But I got an error when creating the pattern object:

Error: '\w'  is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"

我做错了什么?

在此先感谢您的帮助.

推荐答案

您可以使用 gsub 删除所有不需要的字符并保留您想要的字符.从预期的输出来看,您似乎仍在使用 3 个字符长的单词中的字符:

You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:

 gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS"   "DSOPGR" "DASP"   "DAI"

但如果我们忽略你指出的词,那就是:

But if we were to ignore the words you indicated then it would be:

gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP"  "DAI"

这篇关于提取R中每个单词的第一个字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 18:46