问题描述
存在无法转换为 Unicode 字符串的无效字节序列.在 Go 中将 []byte
转换为 string
时如何检测?
There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte
to string
in Go?
推荐答案
正如 Tim Cooper 所说,您可以使用 utf8.Valid
.
You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid
.
但是!您可能认为将非 UTF-8 字节转换为 Go string
是不可能的.事实上,在 Go 中,字符串实际上是一个只读的字节片";它可以包含不是有效 UTF-8 的字节,您可以打印、通过索引访问、传递给 WriteString
方法,甚至往返返回到 []byte
(例如,Write
).
But! You might be thinking that converting non-UTF-8 bytes to a Go string
is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString
methods, or even round-trip back to a []byte
(to Write
, say).
在语言中有两个地方 Go 确实为你做 string
的 UTF-8 解码.
There are two places in the language that Go does do UTF-8 decoding of string
s for you.
- 当您执行
for i, r := range s
时,r
是一个 Unicode 代码点,作为rune
类型的值莉> - 当您进行转换
[]rune(s)
时,Go 会将整个字符串解码为符文.
- when you do
for i, r := range s
ther
is a Unicode code point as a value of typerune
- when you do the conversion
[]rune(s)
, Go decodes the whole string to runes.
(注意rune
是int32
的别名,不是完全不同的类型.)
(Note that rune
is an alias for int32
, not a completely different type.)
在这两种情况下,无效的 UTF-8 被替换为 U+FFFD
,替换字符 保留用于此类用途.更多内容在 for
语句 和 之间的转换.这些转换永远不会崩溃,因此您只需要主动检查 UTF-8 有效性是否与您的应用程序相关,例如如果您不能接受 U+FFFD 替换并且需要在错误编码的输入上抛出错误.
In both these instances invalid UTF-8 is replaced with U+FFFD
, the replacement character reserved for uses like this. More is in the spec sections on for
statements and conversions between string
s and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.
由于该行为已融入语言,因此您也可以从库中期待它.U+FFFD
是 utf8.RuneError
并由 utf8
中的函数返回.
Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD
is utf8.RuneError
and returned by functions in utf8
.
这是一个示例程序,展示了 Go 对包含无效 UTF-8 的 []byte
做了什么:
Here's a sample program showing what Go does with a []byte
holding invalid UTF-8:
package main
import "fmt"
func main() {
a := []byte{0xff}
s := string(a)
fmt.Println(s)
for _, r := range s {
fmt.Println(r)
}
rs := []rune(s)
fmt.Println(rs)
}
输出在不同的环境中看起来会有所不同,但在 Playground 中它看起来像
Output will look different in different environments, but in the Playground it looks like
�
65533
[65533]
这篇关于如何检测 Go 中何时无法将字节转换为字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!