将非 ASCII 字符从 ASCII-8BIT 转换为 UTF-8

本文介绍了将非 ASCII 字符从 ASCII-8BIT 转换为 UTF-8的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从远程站点提取文本并尝试将其加载到默认使用 utf-8 的 Ruby 1.9/Rails 3 应用程序中.

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

以下是一些冒犯性文字的示例:

Here is an example of some offending text:

Cancer Res; 71(3); 1-11. ©2011 AACR.

扩展后的版权代码如下所示:

That Copyright code expanded looks like this:

Cancer Res; 71(3); 1-11. xC2xA92011 AACR.

Ruby 告诉我字符串被编码为 ASCII-8BIT 并输入到我的 Rails 应用程序中得到这个:

Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:

incompatible character encodings: ASCII-8BIT and UTF-8

我可以使用这个正则表达式去除版权代码

I can strip the copyright code out using this regex

str.gsub(/[x00-x7F]/n,'?')

制作这个

Cancer Res; 71(3); 1-11. ??2011 AACR.

但是如何将版权符号(以及各种其他符号，如希腊字母)转换为 UTF-8 中的相同符号?当然有可能……

But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...

我看到对使用 force_encoding 的引用，但这不起作用:

I see references to using force_encoding but this does not work:

str.force_encoding('utf-8').encode

我知道还有很多其他人有类似的问题，但我还没有看到有效的解决方案.

I realize there are many other people with similar issues but I've yet to see a solution that works.

将非

将非 ASCII 字符从 ASCII-8BIT 转换为 UTF-8

问题描述

推荐答案