本文介绍了使用哈希字典的词形还原函数不适用于 R 中的 tm 包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用大型外部词典(格式类似于下面的 txt 变量)对波兰语文本进行词形还原.我很不走运,有一个带有流行文本挖掘包的波兰语选项.@DmitriySelivanov 的答案 https://stackoverflow.com/a/45790325/3480717 适用于简单的文本向量.(我还从字典和语料库中删除了波兰语变音符号.)该函数适用于文本向量.

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts.

遗憾的是它不适用于 tm 生成的语料库格式.让我粘贴 Dmitriy 的代码:

Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code:

library(hashmap)
library(data.table)
txt =
  "Abadan  Abadanem
  Abadan  Abadanie
  Abadan  Abadanowi
  Abadan  Abadanu
  abadańczyk  abadańczycy
  abadańczyk  abadańczykach
  abadańczyk  abadańczykami
  "
dt = fread(txt, header = F, col.names = c("lemma", "word"))
lemma_hm = hashmap(dt$word, dt$lemma)

lemma_hm[["Abadanu"]]
#"Abadan"


lemma_tokenizer = function(x, lemma_hashmap,
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  tokens_list
}
texts = c("Abadanowi abadańczykach OutOfVocabulary",
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)

#[[1]]
#[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
#[[2]]
#[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"
docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm"))

我尝试过的另一种语法:

another syntax that I tried:

LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer))

它向我抛出一个错误:

 Error in lemma_hashmap[[tokens]] :
  attempt to select more than one element in vectorIndex

该函数适用于文本向量,但不适用于 tm 语料库.在此先感谢您的建议(如果它不适用于 tm,甚至可以将此功能与其他文本挖掘包一起使用).

The function works with a vector of texts but it will not work with tm corpus. Thanks in advance for suggestions (even use of this function with other text mining package if it will not work with tm).

推荐答案

我在这里看到两个问题.1)您的自定义函数返回一个列表,而它应该返回一个字符串向量;和 2) 您传递了错误的 lemma_hashmap 参数.

I see two problems here. 1) your custom function returns a list, while it should return a vector of strings; and 2) you are passing a wrong lemma_hashmap argument.

解决第一个问题的快速解决方法是在返回函数结果之前使用 paste() 和 sapply().

A quick workaround to fix the first problem is to use paste() and sapply() before returning the function result.

lemma_tokenizer = function(x, lemma_hashmap,
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }

  # paste together, return a vector
  sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
}

我们可以运行与您的帖子相同的示例.

We can run the same example of your post.

texts = c("Abadanowi abadańczykach OutOfVocabulary",
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)
[1] "Abadan abadańczyk OutOfVocabulary" "abadańczyk Abadan OutOfVocabulary"

现在,我们可以使用 tm_map.只需确保使用 lemma_hm(即变量)而不是lemma_hm"(字符串)作为参数.

Now, we can use tm_map. Just make sure to use lemma_hm (i.e., the variable) and not "lemma_hm" (a string) as argument.

docs <- SimpleCorpus(VectorSource(texts))
out <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
out[[1]]$content
[1] "Abadan abadańczyk OutOfVocabulary"

这篇关于使用哈希字典的词形还原函数不适用于 R 中的 tm 包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 02:16