本文介绍了代币发行问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正尝试如下标记一个句子.
I am trying to tokenize a sentence as follows.
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)
当我使用tidytext和下面的代码标记时,
When I tokenize using tidytext and the code below,
AA <- df %>%
mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
locations = str_locate_all(df$Section, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
它给了我如下所示的结果集(见图片).
it gives me a result set as below (see image).
我如何将逗号和句点作为独立的记号获得,而不是出现"和注入"的一部分.分别使用tidytext.所以我的令牌应该是
How do i get the comma and the period as independent tokens as not part of 'occurs,' and 'infusion.' respectively, using tidytext. so my tokens should be
If
an
infusion
reaction
occurs
,
interrupt
the
infusion
.
推荐答案
事先用其他东西代替它们.请确保在更换前添加一个空格.然后在空格处分隔句子.
Replace them with something else beforehand. Make sure to add a space before the replacement. Then split the sentences at spaces.
include = c(".", ",") #The symbols that should be included
mystr = Section # copy data
for (mypattern in include){
mystr = gsub(pattern = mypattern,
replacement = paste0(" ", mypattern),
x = mystr, fixed = TRUE)
}
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
#[[1]]
# Tokens
#1 If
#2 an
#3 infusion
#4 reaction
#5 occurs
#6 ,
#7 interrupt
#8 the
#9 infusion
#10 .
这篇关于代币发行问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!