Ruby 将 CSV 文件读取为 UTF-8 和/或将 ASCII-8Bit 编码转换为 UTF-8

本文介绍了Ruby 将 CSV 文件读取为 UTF-8 和/或将 ASCII-8Bit 编码转换为 UTF-8的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的是 ruby 1.9.2

I'm using ruby 1.9.2

我正在尝试解析包含一些法语单词(例如 spécifié)的 CSV 文件并将内容放入 MySQL 数据库中.

I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.

当我从 CSV 文件中读取行时，

When I read the lines from the CSV file,

file_contents = CSV.read("csvfile.csv", col_sep: "$")

元素作为 ASCII-8BIT 编码的字符串返回(spécifié 变成 spxE9cifixE9)，然后像spécifié"这样的字符串没有正确保存到我的 MySQL 数据库中.

The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes spxE9cifixE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.

Yehuda Katz 表示 ASCII-8BIT 确实是二进制"数据，这意味着 CSV 不知道如何读取适当的编码.

Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.

所以，如果我尝试让 CSV 像这样强制编码:

So, if I try to make CSV force the encoding like this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")

我收到以下错误

ArgumentError: invalid byte sequence in UTF-8:

如果我回到我原来的 ASCII-8BIT 编码字符串并检查我的 CSV 读取为 ASCII-8BIT 的字符串，它看起来像这样Non spxE9cifixE9"而不是Non spécifié".

If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non spxE9cifixE9" instead of "Non spécifié".

我无法通过这样做将Non spxE9cifixE9"转换为Non spécifié""非 spxE9cifixE9".encode("UTF-8")

I can't convert "Non spxE9cifixE9" to "Non spécifié" by doing this"Non spxE9cifixE9".encode("UTF-8")

因为我收到此错误:

Encoding::UndefinedConversionError: "xE9" from ASCII-8BIT to UTF-8,

Katz 表示会发生这种情况，因为 ASCII-8BIT 并不是真正正确的字符串编码".

which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".

问题:

我可以让 CSV 以适当的编码读取我的文件吗?如果是，怎么办?
如何将 ASCII-8BIT 字符串转换为 UTF-8 以在 MySQL 中正确存储?

推荐答案

deceze 没错，就是ISO8859-1(AKA Latin-1) 编码的文本.试试这个:

deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")

如果这不起作用，您可以使用 Iconv 使用以下内容修复单个字符串:

And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:

require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first

如果 latin1_string 是 "Non spxE9cifixE9"，那么 utf8_string 将是 "Non spécifié".此外，Iconv.iconv 可以一次解开整个数组:

If latin1_string is "Non spxE9cifixE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:

utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)

使用较新的 Ruby，您可以执行以下操作:

With newer Rubies, you can do things like this:

utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')

其中 latin1_string 认为它是 ASCII-8BIT，但实际上是 ISO-8859-1.

where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.

这篇关于Ruby 将 CSV 文件读取为 UTF-8 和/或将 ASCII-8Bit 编码转换为 UTF-8的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！