问题描述
通常,我们可以使用类似
Usually we can get a string
from a byte[]
using something like
var result = Encoding.UTF8.GetString(bytes);
但是,我遇到了这个问题:我的输入是 IEnumerable< byte []>.字节
(实现可以是我选择的任何结构).不能保证字符在 byte []
之内(例如,一个2字节的UTF8字符的第一个字节可以为bytes [1] [length-1],第二个字节可以为bytes)[2] [0]).
However, I am having this problem: my input is an IEnumerable<byte[]> bytes
(implementation can be any structure of my choice). It is not guaranteed a character is within a byte[]
(for example, a 2-byte UTF8 char can have its 1st byte in bytes[1][length - 1] and its 2nd byte in bytes[2][0]).
是否仍然可以在不将所有数组合并/复制在一起的情况下对它们进行解码? UTF8是主要重点,但最好是支持其他编码.如果没有其他解决方案,我认为可以实现自己的UTF8阅读.
Is there anyway to decode them without merging/copying all the array together? UTF8 is main focus but it is better if other Encoding can be supported. If there is no other solution, I think implementing my own UTF8 reading would be the way.
我计划使用 MemoryStream
来流式传输它们,但是编码只能在 byte []
上的 Stream
上使用.如果合并在一起,则可能的结果数组可能会很大( List< byte []>
中的最大4GB).
I plan to stream them using a MemoryStream
, however Encoding cannot work on Stream
, just byte[]
. If merged together, the potential result array may be very large (up to 4GB in List<byte[]>
already).
我正在使用.NET Standard 2.0.我希望我可以使用2.1(因为它尚未发布),并使用 Span< byte []>
,对于我的情况来说是完美的!
I am using .NET Standard 2.0. I wish I could use 2.1 (as it is not released yet) and using Span<byte[]>
, would be perfect for my case!
推荐答案
Encoding
类不能直接处理,但是 解码器
从 Encoding.GetDecoder()
可以(实际上,这是存在的全部原因). StreamReader
在内部使用 Decoder
.
The Encoding
class can't deal with that directly, but the Decoder
returned from Encoding.GetDecoder()
can (indeed, that's its entire reason for existing). StreamReader
uses a Decoder
internally.
虽然有点麻烦,但是它需要填充 char []
,而不是返回 string
( Encoding.GetString()
和 StreamReader
通常处理填充 char []
)的事务.
It's slightly fiddly to work with though, as it needs to populate a char[]
, rather than returning a string
(Encoding.GetString()
and StreamReader
normally handle the business of populating the char[]
).
使用 MemoryStream
的问题是,您将所有字节从一个数组复制到另一个数组,没有任何收益.如果所有缓冲区的长度都相同,则可以执行以下操作:
The problem with using a MemoryStream
is that you're copying all of the bytes from one array to another, for no gain. If all of your buffers are the same length, you can do this:
var decoder = Encoding.UTF8.GetDecoder();
// +1 in case it includes a work-in-progress char from the previous buffer
char[] chars = decoder.GetMaxCharCount(bufferSize) + 1;
foreach (var byteSegment in bytes)
{
int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
Debug.WriteLine(new string(chars, 0, numChars));
}
如果缓冲区的长度不同:
If the buffers have different lengths:
var decoder = Encoding.UTF8.GetDecoder();
char[] chars = Array.Empty<char>();
foreach (var byteSegment in bytes)
{
// +1 in case it includes a work-in-progress char from the previous buffer
int charsMinSize = decoder.GetMaxCharCount(bufferSize) + 1;
if (chars.Length < charsMinSize)
chars = new char[charsMinSize];
int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
Debug.WriteLine(new string(chars, 0, numChars));
}
这篇关于编码API可以解码流/非连续字节吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!