本文介绍了将非 ASCII 字符从 ASCII-8BIT 转换为 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从远程站点提取文本并尝试将其加载到默认使用 utf-8 的 Ruby 1.9/Rails 3 应用程序中.

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

以下是一些冒犯性文字的示例:

Here is an example of some offending text:

Cancer Res; 71(3); 1-11. ©2011 AACR.

扩展后的版权代码如下所示:

That Copyright code expanded looks like this:

Cancer Res; 71(3); 1-11. xC2xA92011 AACR.

Ruby 告诉我字符串被编码为 ASCII-8BIT 并输入到我的 Rails 应用程序中得到这个:

Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:

incompatible character encodings: ASCII-8BIT and UTF-8

我可以使用这个正则表达式去除版权代码

I can strip the copyright code out using this regex

str.gsub(/[x00-x7F]/n,'?')

制作这个

Cancer Res; 71(3); 1-11. ??2011 AACR.

但是如何将版权符号(以及各种其他符号,如希腊字母)转换 为 UTF-8 中的相同符号?当然有可能……

But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...

我看到对使用 force_encoding 的引用,但这不起作用:

I see references to using force_encoding but this does not work:

str.force_encoding('utf-8').encode

我知道还有很多其他人有类似的问题,但我还没有看到有效的解决方案.

I realize there are many other people with similar issues but I've yet to see a solution that works.

推荐答案

这对我有用:

#encoding: ASCII-8BIT
str = "xC2xA92011 AACR"
p str, str.encoding
#=> "xC2xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>

str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>

这篇关于将非 ASCII 字符从 ASCII-8BIT 转换为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-20 23:09