以正确的编码将带有希腊字符的

以正确的编码将带有希腊字符的

本文介绍了以正确的编码将带有希腊字符的 Excel 文件导入 R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在导入以下文件时遇到了一些问题:http://www.kuleuven.be/bio/ento/temp/test.xlsx以正确的编码进入 R.尤其是

I am having some trouble importing the following file:http://www.kuleuven.be/bio/ento/temp/test.xlsxinto R in the correct encoding.In particular,

library("xlsx")
read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")

给我

                                             X1
1                                     a-cadinol
2                                  a-calacorene
3                       a-caryophyllene alcohol
4                                   a-curcumene
5                                      a-elemol
6                                   a-muurolene
7                           a-terpineol acetate
8  ß-4-dimethyl-3-cyclohexane-1-ethanol acetate
9                                  ß-bisabolene
10                                  ß-bisabolol
11                                 ß-bourbonene
12                      ß-caryophyllene alcohol
13                                ß-cyclocitral
14                                   ß-farnesol
15                                   ß-selinene
16                         ß-sesquiphellandrene
17                            <U+03B3>-cadinene
18  <U+03B3>-Carboethoxy-<U+03B3>-butyrolactone
19        <U+03B3>-ethyl-<U+03B3>-butyrolactone
20                            <U+03B3>-eudesmol
21                           <U+03B3>-muurolene
22                         <U+03B3>-nonalactone
23                         <U+03B3>-octalactone
24                            <U+03B3>-selinene
25                       <U+03B3>-undecalactone
26                                   d-cadinene
27                                    d-cadinol
28                                  d-muurolene
29                              d-undecalactone

但是a--d-应该是alpha-, gamma-delta-

but the a-, <U+03B3>- and d- should be alpha-, gamma- and delta-

对如何以正确的编码导入文件有任何想法吗?

Any thoughts on how I could import my file in the correct encoding?

我在 Windows 上工作,iconvlist() 给了我

I am working on Windows, and iconvlist() gives me

  [1] "437"                     "850"                     "852"                     "855"                     "857"
  [6] "860"                     "861"                     "862"                     "863"                     "865"
 [11] "866"                     "869"                     "ANSI_X3.4-1968"          "ANSI_X3.4-1986"          "ASCII"
 [16] "ASMO-708"                "BIG-5"                   "BIG-FIVE"                "big5"                    "BIG5"
 [21] "big5-hkscs"              "BIG5-HKSCS"              "big5hkscs"               "BIG5HKSCS"               "CP-GR"
 [26] "CP-IS"                   "cp1025"                  "CP1125"                  "CP1133"                  "CP1200"
 [31] "CP12000"                 "CP12001"                 "CP1201"                  "CP1250"                  "CP1251"
 [36] "CP1252"                  "CP1253"                  "CP1254"                  "CP1255"                  "CP1256"
 [41] "CP1257"                  "CP1258"                  "CP1361"                  "CP154"                   "CP367"
 [46] "CP437"                   "CP50221"                 "CP51932"                 "CP65001"                 "CP737"
 [51] "CP775"                   "CP819"                   "CP850"                   "CP852"                   "CP853"
 [56] "CP855"                   "CP857"                   "CP858"                   "CP860"                   "CP861"
 [61] "CP862"                   "CP863"                   "CP864"                   "CP865"                   "cp866"
 [66] "CP866"                   "CP869"                   "CP874"                   "cp875"                   "CP932"
 [71] "CP936"                   "CP949"                   "CP950"                   "CSASCII"                 "CSIBM855"
 [76] "CSIBM857"                "CSIBM860"                "CSIBM861"                "CSIBM863"                "CSIBM864"
 [81] "CSIBM865"                "CSIBM866"                "CSIBM869"                "csISO2022JP"             "CSISOLATIN1"
 [86] "CSPC775BALTIC"           "CSPC850MULTILINGUAL"     "CSPC862LATINHEBREW"      "CSPC8CODEPAGE437"        "CSPCP852"
 [91] "CSPTCP154"               "CSWINDOWS31J"            "CYRILLIC-ASIAN"          "DOS-720"                 "DOS-862"
 [96] "EUC-CN"                  "euc-jp"                  "euc-kr"                  "EUC-KR"                  "EUCCN"
[101] "eucjp"                   "euckr"                   "GB18030"                 "gb2312"                  "GBK"
[106] "hz-gb-2312"              "IBM-CP1133"              "IBM-Thai"                "IBM00858"                "IBM00924"
[111] "IBM01047"                "IBM01140"                "IBM01141"                "IBM01142"                "IBM01143"
[116] "IBM01144"                "IBM01145"                "IBM01146"                "IBM01147"                "IBM01148"
[121] "IBM01149"                "IBM037"                  "IBM1026"                 "IBM273"                  "IBM277"
[126] "IBM278"                  "IBM280"                  "IBM284"                  "IBM285"                  "IBM290"
[131] "IBM297"                  "IBM367"                  "IBM420"                  "IBM423"                  "IBM424"
[136] "IBM437"                  "IBM437"                  "IBM500"                  "ibm737"                  "ibm775"
[141] "IBM775"                  "IBM819"                  "ibm850"                  "IBM850"                  "ibm852"
[146] "IBM852"                  "IBM855"                  "IBM855"                  "ibm857"                  "IBM857"
[151] "IBM860"                  "IBM860"                  "ibm861"                  "IBM861"                  "IBM862"
[156] "IBM863"                  "IBM863"                  "IBM864"                  "IBM864"                  "IBM865"
[161] "IBM865"                  "IBM866"                  "ibm869"                  "IBM869"                  "IBM870"
[166] "IBM871"                  "IBM880"                  "IBM905"                  "iso-2022-jp"             "iso-2022-jp"
[171] "ISO-2022-JP"             "ISO-2022-JP-MS"          "iso-2022-kr"             "ISO-8859-1"              "iso-8859-13"
[176] "iso-8859-15"             "iso-8859-2"              "iso-8859-3"              "iso-8859-4"              "iso-8859-5"
[181] "iso-8859-6"              "iso-8859-7"              "iso-8859-8"              "iso-8859-8-i"            "iso-8859-9"
[186] "ISO-IR-100"              "ISO-IR-6"                "ISO_646.IRV:1991"        "ISO_8859-1"              "ISO_8859-1:1987"
[191] "ISO2022-JP"              "ISO2022-JP-MS"           "iso2022-kr"              "ISO646-US"               "iso8859-1"
[196] "ISO8859-1"               "iso8859-13"              "iso8859-15"              "iso8859-2"               "iso8859-3"
[201] "iso8859-4"               "iso8859-5"               "iso8859-6"               "iso8859-7"               "iso8859-8"
[206] "iso8859-8-i"             "iso8859-9"               "Johab"                   "JOHAB"                   "koi8-r"
[211] "koi8-u"                  "ks_c_5601-1987"          "L1"                      "latin-9"                 "LATIN1"
[216] "latin2"                  "latin3"                  "latin4"                  "latin5"                  "latin7"
[221] "latin9"                  "mac"                     "mac-centraleurope"       "mac-is"                  "macarabic"
[226] "maccentraleurope"        "maccroatian"             "maccyrillic"             "macgreek"                "machebrew"
[231] "maciceland"              "macintosh"               "macis"                   "macroman"                "macromania"
[236] "macthai"                 "macturkish"              "macukraine"              "macukrainian"            "MS-ANSI"
[241] "MS-ARAB"                 "MS-CYRL"                 "MS-EE"                   "MS-GREEK"                "MS-HEBR"
[246] "MS-TURK"                 "MS50221"                 "MS51932"                 "MS932"                   "MS936"
[251] "PT154"                   "PTCP154"                 "SHIFFT_JIS"              "SHIFFT_JIS-MS"           "shift-jis"
[256] "shift_jis"               "SJIS"                    "SJIS-MS"                 "SJIS-OPEN"               "SJIS-WIN"
[261] "UCS-2"                   "UCS-2BE"                 "UCS-2LE"                 "UCS-4"                   "UCS-4BE"
[266] "UCS-4BE"                 "UCS-4LE"                 "UCS-4LE"                 "UCS2"                    "UCS2BE"
[271] "UCS2LE"                  "UCS4"                    "UCS4BE"                  "UCS4LE"                  "UHC"
[276] "unicodeFFFE"             "US"                      "US-ASCII"                "UTF-16"                  "UTF-16BE"
[281] "UTF-16LE"                "UTF-32"                  "UTF-32BE"                "UTF-32LE"                "UTF-8"
[286] "UTF16"                   "UTF16BE"                 "UTF16LE"                 "UTF32"                   "UTF32BE"
[291] "UTF32LE"                 "UTF8"                    "WINBALTRIM"              "windows-1250"            "windows-1251"
[296] "windows-1252"            "windows-1253"            "windows-1254"            "windows-1255"            "windows-1256"
[301] "windows-1257"            "windows-1258"            "WINDOWS-31J"             "WINDOWS-50221"           "WINDOWS-51932"
[306] "windows-874"             "WINDOWS-932"             "WINDOWS-936"             "x-Chinese_CNS"           "x-cp20001"
[311] "x-cp20003"               "x-cp20004"               "x-cp20005"               "x-cp20261"               "x-cp20269"
[316] "x-cp20936"               "x-cp20949"               "x-cp50227"               "x-EBCDIC-KoreanExtended" "x-Europa"
[321] "x-IA5"                   "x-IA5-German"            "x-IA5-Norwegian"         "x-IA5-Swedish"           "x-iscii-as"
[326] "x-iscii-be"              "x-iscii-de"              "x-iscii-gu"              "x-iscii-ka"              "x-iscii-ma"
[331] "x-iscii-or"              "x-iscii-pa"              "x-iscii-ta"              "x-iscii-te"              "x-mac-arabic"
[336] "x-mac-ce"                "x-mac-chinesesimp"       "x-mac-chinesetrad"       "x-mac-croatian"          "x-mac-cyrillic"
[341] "x-mac-greek"             "x-mac-hebrew"            "x-mac-icelandic"         "x-mac-japanese"          "x-mac-korean"
[346] "x-mac-romanian"          "x-mac-thai"              "x-mac-turkish"           "x-mac-ukrainian"         "x_Chinese-Eten"

我尝试了其中的许多,但无济于事...不幸的是,我也不知道 Excel 将我的文件保存在什么编码中...

I tried with many of these, to no avail... Unfortunately, I also don't know what encoding Excel saved my file in...

此外,R 中是否有任何简单的函数可以让我将所有希腊字母、beta、gamma 和 delta(原始编码)转换为alpha"、beta"、gamma"和delta"(即完整写出来)?或者反过来,即将完整写出的alpha"、beta"、gamma"等转换为单个希腊字符?

Also, is there any easy function in R that would allow me to convert all Greek alpha, beta, gamma and delta's (in original encoding) to "alpha", "beta", "gamma" and "delta" (ie written out in full)?Or to do the reverse, ie convert "alpha", "beta", "gamma" etc written out in full to single Greek characters?

关于我试过的最后一个问题

regarding my last question I tried

togreek=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("alpha","u03B1",n)
  n=gsub("beta","u03B2",n)
  n=gsub("gamma","u03B3",n)
  n=gsub("delta","u03B4",n)
  n=gsub("epsilon","u03B5",n)
  n
}

tolatin=function(compname) {
  n=as.character(compname,encoding="UTF-8")
  n=gsub("u03B1","alpha",n)
  n=gsub("u03B2","beta",n)
  n=gsub("u03B3","gamma",n)
  n=gsub("u03B4","delta",n)
  n=gsub("u03B5","epsilon",n)
  n
}

函数 tolatin 似乎工作:

The function tolatin seems to work:

library("xlsx")
test=read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")
tolatin(test$X1)
 [1] "alpha-cadinol"                                   "alpha-calacorene"                                "alpha-caryophyllene alcohol"
 [4] "alpha-curcumene"                                 "alpha-elemol"                                    "alpha-muurolene"
 [7] "alpha-terpineol acetate"                         "beta-4-dimethyl-3-cyclohexane-1-ethanol acetate" "beta-bisabolene"
[10] "beta-bisabolol"                                  "beta-bourbonene"                                 "beta-caryophyllene alcohol"
[13] "beta-cyclocitral"                                "beta-farnesol"                                   "beta-selinene"
[16] "beta-sesquiphellandrene"                         "gamma-cadinene"                                  "gamma-Carboethoxy-gamma-butyrolactone"
[19] "gamma-ethyl-gamma-butyrolactone"                 "gamma-eudesmol"                                  "gamma-muurolene"
[22] "gamma-nonalactone"                               "gamma-octalactone"                               "gamma-selinene"
[25] "gamma-undecalactone"                             "delta-cadinene"                                  "delta-cadinol"
[28] "delta-muurolene"                                 "delta-undecalactone"

但是如果我再转换回希腊字符,我又会遇到问题:

But if I then convert back to Greek characters I again run into problems:

togreek(tolatin(test$X1))

 [1] "α-cadinol"                                   "α-calacorene"                                "α-caryophyllene alcohol"
 [4] "α-curcumene"                                 "α-elemol"                                    "α-muurolene"
 [7] "α-terpineol acetate"                         "ß-4-dimethyl-3-cyclohexane-1-ethanol acetate" "ß-bisabolene"
[10] "ß-bisabolol"                                  "ß-bourbonene"                                 "ß-caryophyllene alcohol"
[13] "ß-cyclocitral"                                "ß-farnesol"                                   "ß-selinene"
[16] "ß-sesquiphellandrene"                         "<U+03B3>-cadinene"                            "<U+03B3>-Carboethoxy-<U+03B3>-butyrolactone"
[19] "<U+03B3>-ethyl-<U+03B3>-butyrolactone"        "<U+03B3>-eudesmol"                            "<U+03B3>-muurolene"
[22] "<U+03B3>-nonalactone"                         "<U+03B3>-octalactone"                         "<U+03B3>-selinene"
[25] "<U+03B3>-undecalactone"                       "d-cadinene"                                   "d-cadinol"
[28] "d-muurolene"                                  "d-undecalactone"

有什么想法我做错了吗?

Any thoughts what I am doing wrong?

推荐答案

试试这个:
Sys.setlocale(category = "LC_ALL", locale = "Greek")

这篇关于以正确的编码将带有希腊字符的 Excel 文件导入 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-25 19:17