本文介绍了为什么Apache Orc RecordReader.searchArgument()无法正确过滤?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个简单的程序,该程序:

Here is a simple program that:

  1. 将记录写入Orc文件
  2. 然后尝试使用谓词下推(searchArgument)
  3. 读取文件
  1. Writes records into an Orc file
  2. Then tries to read the file using predicate pushdown (searchArgument)

问题:

  1. 这是在兽人中使用谓词下推的正确方法吗?
  2. read(..)方法似乎返回所有记录,而完全忽略了searchArguments.为什么会这样?
  1. Is this the right way to use predicate push down in Orc?
  2. The read(..) method seems to return all the records, completely ignoring the searchArguments. Why is that?

注释:

我无法找到任何有用的单元测试来演示Orc中谓词下推的工作方式( GitHub上的Orc ).我也找不到关于此功能的任何清晰文档.尝试查看火花 Presto 代码,但是我找不到任何有用的东西.

I have not been able to find any useful unit test that demonstrates how predicate pushdown works in Orc (Orc on GitHub). Nor am I able to find any clear documentation on this feature. Tried looking at Spark and Presto code, but I was not able to find anything useful.

下面的代码是 https://github.com/melanio/codecheese-blog-examples/tree/master/orc-examples/src/main/java/codecheese/blog/examples/orc

public class TestRoundTrip {
public static void main(String[] args) throws IOException {
    final String file = "tmp/test-round-trip.orc";
    new File(file).delete();

    final long highestX = 10000L;
    final Configuration conf = new Configuration();

    write(file, highestX, conf);
    read(file, highestX, conf);
}

private static void read(String file, long highestX, Configuration conf) throws IOException {
    Reader reader = OrcFile.createReader(
            new Path(file),
            OrcFile.readerOptions(conf)
    );

    //Retrieve x that is "highestX - 1000". So, only 1 value should've been retrieved.
    Options readerOptions = new Options(conf)
            .searchArgument(
                    SearchArgumentFactory
                            .newBuilder()
                            .equals("x", Type.LONG, highestX - 1000)
                            .build(),
                    new String[]{"x"}
            );
    RecordReader rows = reader.rows(readerOptions);
    VectorizedRowBatch batch = reader.getSchema().createRowBatch();

    while (rows.nextBatch(batch)) {
        LongColumnVector x = (LongColumnVector) batch.cols[0];
        LongColumnVector y = (LongColumnVector) batch.cols[1];

        for (int r = 0; r < batch.size; r++) {
            long xValue = x.vector[r];
            long yValue = y.vector[r];

            System.out.println(xValue + ", " + yValue);
        }
    }
    rows.close();
}

private static void write(String file, long highestX, Configuration conf) throws IOException {
    TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>");
    Writer writer = OrcFile.createWriter(
            new Path(file),
            OrcFile.writerOptions(conf).setSchema(schema)
    );

    VectorizedRowBatch batch = schema.createRowBatch();
    LongColumnVector x = (LongColumnVector) batch.cols[0];
    LongColumnVector y = (LongColumnVector) batch.cols[1];
    for (int r = 0; r < highestX; ++r) {
        int row = batch.size++;
        x.vector[row] = r;
        y.vector[row] = r * 3;
        // If the batch is full, write it out and start over.
        if (batch.size == batch.getMaxSize()) {
            writer.addRowBatch(batch);
            batch.reset();
        }
    }
    if (batch.size != 0) {
        writer.addRowBatch(batch);
        batch.reset();
    }
    writer.close();
}

}

推荐答案

我遇到了相同的问题,并且我认为通过更改可以解决此问题

I encountered the same issue, and I think it was rectified by changing

.equals("x", Type.LONG,

.equals("x",PredicateLeaf.Type.LONG

使用此功能时,读者似乎只返回带有相关行的批处理,而不仅返回我们要求的一次.

On using this, the reader seems to return only the batch with the relevant rows, not only once which we asked for.

这篇关于为什么Apache Orc RecordReader.searchArgument()无法正确过滤?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:39