问题描述
这是一个简单的程序,该程序:
Here is a simple program that:
- 将记录写入Orc文件
- 然后尝试使用谓词下推(
searchArgument
) 读取文件
- Writes records into an Orc file
- Then tries to read the file using predicate pushdown (
searchArgument
)
问题:
- 这是在兽人中使用谓词下推的正确方法吗?
-
read(..)
方法似乎返回所有记录,而完全忽略了searchArguments
.为什么会这样?
- Is this the right way to use predicate push down in Orc?
- The
read(..)
method seems to return all the records, completely ignoring thesearchArguments
. Why is that?
注释:
我无法找到任何有用的单元测试来演示Orc中谓词下推的工作方式( GitHub上的Orc ).我也找不到关于此功能的任何清晰文档.尝试查看火花和 Presto 代码,但是我找不到任何有用的东西.
I have not been able to find any useful unit test that demonstrates how predicate pushdown works in Orc (Orc on GitHub). Nor am I able to find any clear documentation on this feature. Tried looking at Spark and Presto code, but I was not able to find anything useful.
public class TestRoundTrip {
public static void main(String[] args) throws IOException {
final String file = "tmp/test-round-trip.orc";
new File(file).delete();
final long highestX = 10000L;
final Configuration conf = new Configuration();
write(file, highestX, conf);
read(file, highestX, conf);
}
private static void read(String file, long highestX, Configuration conf) throws IOException {
Reader reader = OrcFile.createReader(
new Path(file),
OrcFile.readerOptions(conf)
);
//Retrieve x that is "highestX - 1000". So, only 1 value should've been retrieved.
Options readerOptions = new Options(conf)
.searchArgument(
SearchArgumentFactory
.newBuilder()
.equals("x", Type.LONG, highestX - 1000)
.build(),
new String[]{"x"}
);
RecordReader rows = reader.rows(readerOptions);
VectorizedRowBatch batch = reader.getSchema().createRowBatch();
while (rows.nextBatch(batch)) {
LongColumnVector x = (LongColumnVector) batch.cols[0];
LongColumnVector y = (LongColumnVector) batch.cols[1];
for (int r = 0; r < batch.size; r++) {
long xValue = x.vector[r];
long yValue = y.vector[r];
System.out.println(xValue + ", " + yValue);
}
}
rows.close();
}
private static void write(String file, long highestX, Configuration conf) throws IOException {
TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>");
Writer writer = OrcFile.createWriter(
new Path(file),
OrcFile.writerOptions(conf).setSchema(schema)
);
VectorizedRowBatch batch = schema.createRowBatch();
LongColumnVector x = (LongColumnVector) batch.cols[0];
LongColumnVector y = (LongColumnVector) batch.cols[1];
for (int r = 0; r < highestX; ++r) {
int row = batch.size++;
x.vector[row] = r;
y.vector[row] = r * 3;
// If the batch is full, write it out and start over.
if (batch.size == batch.getMaxSize()) {
writer.addRowBatch(batch);
batch.reset();
}
}
if (batch.size != 0) {
writer.addRowBatch(batch);
batch.reset();
}
writer.close();
}
}
推荐答案
我遇到了相同的问题,并且我认为通过更改可以解决此问题
I encountered the same issue, and I think it was rectified by changing
.equals("x", Type.LONG,
到
.equals("x",PredicateLeaf.Type.LONG
使用此功能时,读者似乎只返回带有相关行的批处理,而不仅返回我们要求的一次.
On using this, the reader seems to return only the batch with the relevant rows, not only once which we asked for.
这篇关于为什么Apache Orc RecordReader.searchArgument()无法正确过滤?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!