问题描述
我使用的是 ruby 1.9.2
I'm using ruby 1.9.2
我正在尝试解析包含一些法语单词(例如 spécifié)的 CSV 文件并将内容放入 MySQL 数据库中.
I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.
当我从 CSV 文件中读取行时,
When I read the lines from the CSV file,
file_contents = CSV.read("csvfile.csv", col_sep: "$")
元素作为 ASCII-8BIT 编码的字符串返回(spécifié 变成 spxE9cifixE9),然后像spécifié"这样的字符串没有正确保存到我的 MySQL 数据库中.
The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes spxE9cifixE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.
Yehuda Katz 表示 ASCII-8BIT 确实是二进制"数据,这意味着 CSV 不知道如何读取适当的编码.
Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.
所以,如果我尝试让 CSV 像这样强制编码:
So, if I try to make CSV force the encoding like this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")
我收到以下错误
ArgumentError: invalid byte sequence in UTF-8:
如果我回到我原来的 ASCII-8BIT 编码字符串并检查我的 CSV 读取为 ASCII-8BIT 的字符串,它看起来像这样Non spxE9cifixE9"而不是Non spécifié".
If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non spxE9cifixE9" instead of "Non spécifié".
我无法通过这样做将Non spxE9cifixE9"转换为Non spécifié""非 spxE9cifixE9".encode("UTF-8")
I can't convert "Non spxE9cifixE9" to "Non spécifié" by doing this"Non spxE9cifixE9".encode("UTF-8")
因为我收到此错误:
Encoding::UndefinedConversionError: "xE9" from ASCII-8BIT to UTF-8
,
Katz 表示会发生这种情况,因为 ASCII-8BIT 并不是真正正确的字符串编码".
which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".
问题:
- 我可以让 CSV 以适当的编码读取我的文件吗?如果是,怎么办?
- 如何将 ASCII-8BIT 字符串转换为 UTF-8 以在 MySQL 中正确存储?
推荐答案
deceze 没错,就是ISO8859-1(AKA Latin-1) 编码的文本.试试这个:
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
如果这不起作用,您可以使用 Iconv
使用以下内容修复单个字符串:
And if that doesn't work, you can use Iconv
to fix up the individual strings with something like this:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
如果 latin1_string
是 "Non spxE9cifixE9"
,那么 utf8_string
将是 "Non spécifié"
.此外,Iconv.iconv
可以一次解开整个数组:
If latin1_string
is "Non spxE9cifixE9"
, then utf8_string
will be "Non spécifié"
. Also, Iconv.iconv
can unmangle whole arrays at a time:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
使用较新的 Ruby,您可以执行以下操作:
With newer Rubies, you can do things like this:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
其中 latin1_string
认为它是 ASCII-8BIT,但实际上是 ISO-8859-1.
where latin1_string
thinks it is in ASCII-8BIT but is really in ISO-8859-1.
这篇关于Ruby 将 CSV 文件读取为 UTF-8 和/或将 ASCII-8Bit 编码转换为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!