Ruby中的SmarterCSV和文件编码问题

本文介绍了Ruby中的SmarterCSV和文件编码问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理一个似乎具有UTF-16LE编码的文件.如果我运行

I'm working with a file that appears to have UTF-16LE encoding. If I run

File.read(file, :encoding => 'utf-16le')

文件的第一行是:

"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n

如果我使用

csv_text = File.read(file, :encoding => 'utf-16le')

我收到一条错误提示

ASCII incompatible encoding needs binmode (ArgumentError)

如果我将上面的编码切换为

If I switch the encoding in the above to

csv_text = File.read(file, :encoding => 'utf-8')

我进入了代码的SmarterCSV部分，但是出现了一个指出该错误的信息

I make it to the SmarterCSV section of the code, but get an error that states

`=~': invalid byte sequence in UTF-8 (ArgumentError)

完整代码如下.如果我在Rails控制台中运行它，就可以正常工作，但是如果我使用ruby test.rb运行它，它会给我第一个错误:

The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:

require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
  csv_text = File.read(file, :encoding => 'utf-16le')
  File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
  puts 'made it here'
  SmarterCSV.process('/tmp/tmp_file', {
    :col_sep => "\t",
    :force_simple_split => true,
    :headers_in_file => false,
    :user_provided_headers => headers
   }).each do |row|
    converted_row = {}
    converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
    converted_row[:timestamp] = row[:timestamp]
    converted_row[:sender] = row[:sender][2..-2]
    converted_row[:phone_number] = row[:phone_number][2..-2]
    converted_row[:message] = row[:message][1..-2]
    converted_row[:room] = file.gsub(path, '')
  end
end

更新-15/5/13

最终，我决定将文件字符串编码为UTF-8，而不是深入研究SmarterCSV代码. SmarterCSV代码中的第一个问题是，它不允许用户在读取文件时指定二进制模式，但是在调整了源代码以处理该问题之后，出现了无数其他与编码有关的问题，其中许多与编码有关处理未经UTF-8编码的文件上的各种参数.这可能是一种简单的方法，但是在将所有内容编码为UTF-8并将其输入SmarterCSV之前，解决了我的问题.

Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.

推荐答案

将binmode添加到File.read调用中.

Add binmode to the File.read call.

File.read(file, :encoding => 'utf-16le', mode: "rb")

ref: http://ruby-doc. org/core-2.0.0/IO.html#method-c-read

现在将正确的编码传递给SmarterCSV

Now pass the correct encoding to SmarterCSV

SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...

更新

发现smartercsv不支持二进制模式.在OP尝试成功修改代码后，决定了简单的解决方案是将输入转换为smartercsv支持的UTF-8.

It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.

这篇关于Ruby中的SmarterCSV和文件编码问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！