我开始在R中使用Stanford CoreNLP软件包,以便用西班牙语进行一些文本分析。因此,我尝试以下方法:

R

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages("coreNLP")
Installing package into ‘/home/ach/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cran.rediris.es/src/contrib/coreNLP_0.4-1.tar.gz'
Content type 'application/x-gzip' length 17392 bytes (16 KB)
==================================================
downloaded 16 KB

* installing *source* package ‘coreNLP’ ...
** package ‘coreNLP’ successfully unpacked and MD5 sums checked
** R
** data
*** moving datasets to lazyload DB
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (coreNLP)

The downloaded source packages are in
    ‘/tmp/RtmpO3q77z/downloaded_packages’
> library(coreNLP)
> downloadCoreNLP(type="base")
trying URL 'http://nlp.stanford.edu/software//stanford-corenlp-full-2015-04-20.zip'
Content type 'application/zip' length 360824440 bytes (344.1 MB)
==================================================
downloaded 344.1 MB

[1] 0
>
> downloadCoreNLP(type="spanish")
trying URL 'http://nlp.stanford.edu/software//stanford-spanish-corenlp-2015-01-08-models.jar'
Content type 'application/x-java-archive' length 25007256 bytes (23.8 MB)
==================================================
downloaded 23.8 MB

> initCoreNLP()
Searching for resource: config.properties
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [3.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [2.3 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
Adding annotator dcoref
Adding annotator sentiment
> > sInes <- "Hola padre. Acabo de llegar a casa. Tengo ganas de cenar"
> annotation <- annotateString(sInes)
> token <- getToken(annotation)
> token[token$sentence==2,c(1:4,7)]
  sentence id  token  lemma POS
4        2  1  Acabo  Acabo NNP
5        2  2     de     de NNP
6        2  3 llegar llegar NNP
7        2  4      a      a  DT
8        2  5   casa   casa  FW
9        2  6      .      .   .


一切似乎都工作正常(据我所知,看不到任何错误),但是它不起作用。例如,“ casa”被标记为不正确的外来词(FW)。

那么,有人对此有任何想法吗?

非常感谢

奥古斯丁

最佳答案

您不仅需要下载西班牙语,还需要将令牌生成器设置为西班牙语:

props.setProperty("tokenize.language", "es");

07-24 09:45