本文介绍了InputStreamReader缓冲问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从一个文件中读取数据,该文件有两种类型的字符编码。

I am reading data from a file that has, unfortunately, two types of character encoding.

有一个标题和一个正文。头部始终为ASCII,并定义正文编码的字符集。

There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.

头部不是固定长度,必须通过解析器运行以确定其内容/ length。

The header is not fixed length and must be run through a parser to determine its content/length.

文件也可能相当大,所以我需要避免将整个内容带到内存中。

The file may also be quite large so I need to avoid bring the entire content into memory.

所以我开始一个单一的InputStream。我最初用一个InputStreamReader用ASCII包装它,并解码标题并提取正文的字符集。所有好的。

So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.

然后我创建一个新的InputStreamReader与正确的字符集,放在同一个InputStream,并开始尝试读取身体。

Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.

不幸的是,javadoc证实了这一点,InputStreamReader可能选择为提高效率而进行预读。所以阅读标题咀嚼一些/所有的身体。

Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.

有没有人有任何建议,这个问题的工作?手动创建一个CharsetDecoder,每次一个字节,但是一个好主意(可能包装在一个自定义Reader实现?)

Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)

提前感谢。

编辑:我的最终解决方案是编写一个没有缓冲的InputStreamReader,以确保我可以解析头部而不咀嚼身体的一部分。虽然这不是很有效率,我用BufferedInputStream包装原始的InputStream,所以它不会是一个问题。

My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}


推荐答案

你不使用2 InputStream s?一个用于读取标题,另一个用于正文。

Why don't you use 2 InputStreams? One for reading the header and another for the body.

第二个 InputStream 应该 skip 头字节。

这篇关于InputStreamReader缓冲问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-13 21:17