问题描述
我正在从远程站点提取文本并尝试将其加载到默认使用 utf-8 的 Ruby 1.9/Rails 3 应用程序中.
I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.
以下是一些冒犯性文字的示例:
Here is an example of some offending text:
Cancer Res; 71(3); 1-11. ©2011 AACR.
扩展后的版权代码如下所示:
That Copyright code expanded looks like this:
Cancer Res; 71(3); 1-11. xC2xA92011 AACR.
Ruby 告诉我字符串被编码为 ASCII-8BIT 并输入到我的 Rails 应用程序中得到这个:
Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:
incompatible character encodings: ASCII-8BIT and UTF-8
我可以使用这个正则表达式去除版权代码
I can strip the copyright code out using this regex
str.gsub(/[x00-x7F]/n,'?')
制作这个
Cancer Res; 71(3); 1-11. ??2011 AACR.
但是如何将版权符号(以及各种其他符号,如希腊字母)转换 为 UTF-8 中的相同符号?当然有可能……
But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...
我看到对使用 force_encoding 的引用,但这不起作用:
I see references to using force_encoding but this does not work:
str.force_encoding('utf-8').encode
我知道还有很多其他人有类似的问题,但我还没有看到有效的解决方案.
I realize there are many other people with similar issues but I've yet to see a solution that works.
推荐答案
这对我有用:
#encoding: ASCII-8BIT
str = "xC2xA92011 AACR"
p str, str.encoding
#=> "xC2xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>
str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>
这篇关于将非 ASCII 字符从 ASCII-8BIT 转换为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!