问题描述
我在使用PHP从CSV档案读取unicode字元时遇到问题。
I am facing problem on reading unicode characters from CSV file using PHP.
下面是UNICODE csv档案的萤幕撷取画面。
Find below is the screenshot of the UNICODE csv file.
>
我使用的PHP代码如下。
The PHP code I use is as below.
$delimiter = ",";
$row = 1;
$handle = fopen($filePath, "r");
while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
$num = count($data);
$row++;
for ($c=0; $c < $num; $c++) {
echo $data[$c];
}
}
fclose($handle);
对于上面的代码,我得到下面的输出在chrome浏览器。它有垃圾字符。
For the above code I get the below as output in chrome browser. It has junk characters.
但是如果我在echo语句上添加一个换行符,如下所示,它会给出正确的输出。
But if I add a newline character on the echo statement as below it gives the correct output.
echo $data[$c]."\n";
为什么会这样?我不想添加这样的换行符。
Why it behaves like this? I do not want to append a newline like this.
推荐答案
Windows调用Unicode(误导性; Unicode不是编码)的编码实际上是UTF-16LE。这是一个每字节两个字节的编码单位编码,因此ASCII字符输出为ASCII字节,后跟零字节。
The encoding that Windows calls "Unicode" (misleadingly; Unicode is not an encoding) is actually UTF-16LE. This is a two-byte-per-code-unit encoding, so ASCII characters come out as the ASCII byte followed by a zero byte.
PHP fgetcsv
函数不支持UTF-16 CSV,它只支持与ASCII兼容的编码。它在每个字节0x0A(换行)和0x2C(逗号)上分割,但在UTF-16LE中,换行符和逗号分别是双字节序列,0x0A 0x00和0x2C 0x00。这意味着你在每个字段前面引导单个0x00字节,但是第一个,当值包含不是UTF-16编码的换行符/逗号的一部分的0x0A或0x2C字节时,你会得到错误的拆分。
PHP's fgetcsv
function doesn't support UTF-16 CSV, it only supports encodings that are ASCII-compatible. It splits on each byte 0x0A (newline) and 0x2C (comma), but in UTF-16LE both the newline and the comma are two-byte sequences, 0x0A 0x00 and 0x2C 0x00 respectively. That means you get leading single 0x00 bytes on the front of each field but the first, and you get wrong splits when a value contains a 0x0A or 0x2C byte that is not part of a UTF-16-encoded newline/comma.
当打印输出到UTF-16LE编码输出时,额外的0x00字节使每个字段与最后一个字节对齐,这意味着浏览器查看它将交替字段视为不对齐,并打印由一个字符的前导字节形成的无意义字符与前一个字符的前导字节。
When you print this out to UTF-16LE-encoded output, the extra 0x00 byte puts each field out of two-byte-alignment with the last, which means that the browser viewing it sees alternating fields as being out of alignment and prints nonsense characters formed of the lead byte of one character with the trail byte of the one before it.
因此,有两个可能你可以做的事情:
So there are two possible things you can do:
-
如果你有任何选择,避免使用UTF-16。因为它不是ASCII兼容,它打破了许多期望的工具。一般来说,最好的编码是UTF-8,它可以包括所有字符,仍然是一个ASCII超集...不幸的是,Excel拒绝直接以UTF-8保存CSV文件。
if you have any choice in the matter, avoid UTF-16. Because it's not ASCII-compatible it breaks lots of tools that expect that. Generally the best encoding is UTF-8, which can include all characters and still be an ASCII-superset... unfortunately Excel refuses to save CSV files directly in UTF-8.
使用一些了解UTF-16的其他CSV解析器。避免使用PHP的CSV函数是一个好主意,因为他们做奇怪的事情,不匹配标准的CSV(就像有一个标准...至少它不符合RFC 4180和Excel生成)。
use some other CSV parser that understands UTF-16. It's a good idea to avoid PHP's CSV functions anyway because they do weird things that don't match standard CSV (in as much as there is a standard... at least it doesn't match RFC 4180 and what Excel produces).
这篇关于使用PHP读取UNICODE CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!