问题描述
我正在尝试使用dask
read_parquet
方法和filters
kwarg读取镶木地板文件.但是有时它不会根据给定的条件进行过滤.
I am trying to read parquet files using thedask
read_parquet
method and the filters
kwarg. however it sometimes doesn't filter according to the given condition.
示例:dates
列创建和保存数据框
Example:creating and saving data frame with a dates
column
import pandas as pd
import numpy as np
import dask.dataframe as dd
nums = range(1,6)
dates = pd.date_range('2018-07-01', periods=5, freq='1d')
df = pd.DataFrame({'dates':dates, 'nums': nums})
ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine = 'fastparquet')
当我从'test_par'
文件夹中读取并过滤dates
列时,它似乎不起作用
when i read and filter on the dates
column from the 'test_par'
folder it doesn't seem to work
filters=[('dates', '>', np.datetime64('2018-07-04'))]
df = dd.read_parquet('test_par', engine='fastparquet', filters=filters).compute()
在输出中可以看到
,其中存在2018-07-03
和2018-07-04
.
as you can see in the output, 2018-07-03
and 2018-07-04
are present.
+-------+------------+------+
| | dates | nums |
+-------+------------+------+
| index | | |
+-------+------------+------+
| 2 | 2018-07-03 | 3 |
+-------+------------+------+
| 3 | 2018-07-04 | 4 |
+-------+------------+------+
| 4 | 2018-07-05 | 5 |
+-------+------------+------+
我做错什么了吗?还是我应该在github上报告这个问题?
Am i doing something wrong ? or should i report this on github ?
推荐答案
filters
关键字是针对行组的操作(行组是一组数据行的拼写术语,例如数据分区-框架).它不会在分区内进行任何过滤.
The filters
keyword is a row-group-wise action (row-group is the parquet term for a set of data rows, like partition for a data-frame). It does not do any filtering within partitions.
使用filters
时,将排除分区,其中根据文件中的最大/最小统计信息,给定分区中没有 no 行可以匹配给定过滤器.例如,如果您指定x> 5,则将排除具有min = 2,max = 4的分区,但不会排除具有min = 2,max = 6的分区,即使后者仅包含满足条件的行.过滤器.
When you use filters
, you will be excluding partitions in which, according to the max/min statistics in the file, there are no rows in a given partition which can match the given filter. For example, if you specify x>5, a partition that has min=2,max=4 will be excluded, but one with min=2,max=6 will not, even though the latter will contain only some rows that meet the filter.
要过滤数据,您仍应使用常规语法
To filter the data, you should still use usual syntax
df[df.dates > np.datetime64('2018-07-04')]
除过滤器外,还可以将过滤器的使用视为可选优化.没有它,Dask将不得不读取没有良好数据的偶数分区,然后应用该条件,从而导致这些分区没有结果.如果可能的话,最好不要加载它们.
in addition to filter, and view the use of filters as an optional optimisation. Without it, Dask would have to read even partitions with no good data, and then apply the condition, resulting in no results for those partitions. Better not to load them, if possible.
这篇关于使用dask read_parquet方法过滤会产生不需要的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!