



我有一个包含一些分类变量的 data.frame.让我们假设 sentences 是这些变量之一:

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:

sentences <- c("Direito à participação e ao controle social",
               "Direito a ser ouvido pelo governo e representantes",
               "Direito aos serviços públicos",
               "Direito de acesso à informação")

对于每个值,我只想提取每个单词的第一个字母,忽略单词是否有 4 个或更少的字母(e、de、à、a、aos、ser、pelo),我的目标是创建首字母缩略词变量.我希望得到以下结果:

For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:

[1] "DPCS", "DOGR", "DSP", "DAI

我尝试使用 stringr 和建立的正则表达式模式创建模式子集 这里:

I tried to make a pattern subset using stringr with a regex pattern founded here:

pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)

但是我在创建 pattern 对象时出错:

But I got an error when creating the pattern object:

Error: '\w'  is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"




您可以使用 gsub 删除所有不需要的字符并保留您想要的字符.从预期的输出来看,您似乎仍在使用 3 个字符长的单词中的字符:

You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:

 gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS"   "DSOPGR" "DASP"   "DAI"


But if we were to ignore the words you indicated then it would be:

gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP"  "DAI"


08-14 18:46