mysql - 编码“UTF8”的无效字节序列:0xed 0xa0 0xbd

我一直在从MySQL导入一些数据到Postgres，该计划应该很简单-手动重新创建具有相同数据类型的表，划分一种输出为CSV的方式，将数据传输，然后将其复制到Postgres。做完了

mysql -u whatever -p whatever -d the_database

SELECT * INTO OUTFILE '/tmp/the_table.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' FROM the_table;

发送并导入到postgres

psql -etcetc -d other_database

COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' );

太久了，我已经忘记了'0000-00-00'是一件事情...
因此，首先我必须想出一种方法来处理奇怪的数据类型，最好是在MySQL端，因此为我计划导入的20个左右的表编写了此脚本，以解决任何不兼容问题并相应地列出各列

with a as (
    select
        'the_table'::text as tblname,
        'public'::text as schname
), b as (
    select array_to_string( array_agg( x.column_name ), ',' ) as the_cols from (
        select
            case
                when udt_name = 'timestamp'
                then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
                when udt_name = 'date'
                then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
                else column_name::text
            end as column_name
        from information_schema.columns, a
        where table_schema = a.schname
        and table_name = a.tblname
        order by ordinal_position
    ) x
)
select 'SELECT '|| b.the_cols ||' INTO OUTFILE ''/tmp/'|| a.tblname ||'.csv'' FIELDS TERMINATED BY '','' OPTIONALLY ENCLOSED BY ''"'' ESCAPED BY ''\\'' FROM '|| a.tblname ||';' from a,b;

生成CSV，好的。传输，确定-结束...

BEGIN;
ALTER TABLE the_table SET( autovacuum_enabled = false, toast.autovacuum_enabled = false );
COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' ); -- '
ALTER TABLE the_table SET( autovacuum_enabled = true, toast.autovacuum_enabled = true );
COMMIT;

一切顺利，直到我遇到以下消息：

ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd
CONTEXT:  COPY new_table, line 12345678

第二张表也遇到相同的错误，但是每隔一张成功导入。
现在，MySQL数据库中的所有列和表都设置为utf8，第一个包含消息的令人讨厌的表是

CREATE TABLE whatever(
col1 int(11) NOT NULL AUTO_INCREMENT,
col2 date,
col3 int(11),
col4 int(11),
col5 int(11),
col6 int(11),
col7 varchar(64),
PRIMARY KEY(col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

所以大概数据应该是utf ...对吗？为确保没有重大错误，我编辑了my.cnf以确保我能想到的所有内容都包括编码

[character sets]
default-character-set=utf8
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'

为了进行转换，我更改了最初的“查询生成查询”案例语句以转换列

        case
            when udt_name = 'timestamp'
            then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
            when udt_name = 'date'
            then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
            when udt_name = 'text'
            then 'CONVERT('|| column_name::text || ' USING utf8)'
            else column_name::text
        end as column_name

仍然没有运气。谷歌搜索“ 0xed 0xa0 0xbd”之后，我还是一个聪明人，字符集并不是我真正的事情。
我什至将3 gig csv文件打开到其提到的行，并且似乎没有任何异常，用十六进制编辑器查看时，我看不到那些字节值（编辑：也许我看起来不够努力）所以我开始耗尽想法。我是否错过了一些非常简单的东西，令人担忧的是，其他一些表是否也可能被“更安静地”破坏了？

在ubuntu 14.04操作系统上，MySQL版本是5.5.44，而Postgres是9.4。

最佳答案

没有其他尝试，我尝试了最简单的解决方案，只需更改文件

iconv -f utf-8 -t utf-8 -c the_file.csv > the_file_iconv.csv

在新文件和原始文件之间大约有100个字节，因此在我看不见的地方一定有无效字节，它们“正确地”导入了，所以我想那很好，但是知道是否可以在导入之前发现文件之前，有一些方法可以在创建文件时强制执行正确的编码。