通常的ML管道涉及将熊猫或dask数据帧处理为可传递到ML模型中的形式。但是,许多dask-ml模型不能接受Dask数据帧,因为它们不跟踪每个分区的行数。调用fit
方法将引发Cannot fit on dask.dataframe due to unknown partition lengths error
。我应该怎么做才能将Dask数据帧传递给dask-ml模型?
这是一个例子:
import dask.dataframe as dd
import pandas as pd
from dask_ml.cluster import KMeans
df = dd.from_pandas(pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]}),
npartitions=2)
kmeans = KMeans()
kmeans.fit(df)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-53-6c1545864b12> in <module>()
6
7 kmeans = KMeans()
----> 8 kmeans.fit(df)
~/anaconda3/envs/pds/lib/python3.6/site-packages/dask_ml/cluster/k_means.py in fit(self, X, y)
187
188 def fit(self, X, y=None):
--> 189 X = self._check_array(X)
190 labels, centroids, inertia, n_iter = k_means(
191 X,
~/anaconda3/envs/pds/lib/python3.6/site-packages/dask_ml/utils.py in wraps(*args, **kwargs)
298 def wraps(*args, **kwargs):
299 with _timer(f.__name__, _logger=logger, level=level):
--> 300 results = f(*args, **kwargs)
301 return results
302
~/anaconda3/envs/pds/lib/python3.6/site-packages/dask_ml/cluster/k_means.py in _check_array(self, X)
159 elif isinstance(X, dd.DataFrame):
160 raise TypeError(
--> 161 "Cannot fit on dask.dataframe due to unknown " "partition lengths."
162 )
163
TypeError: Cannot fit on dask.dataframe due to unknown partition lengths.
最佳答案
现在,使用https://github.com/dask/dask-ml/pull/393的dask-ml master支持此功能
这将包含在Dask-ML 0.10版本中。
关于python - 如何将Dask数据框作为输入传递给dask-ml模型?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/52583316/