我尝试使用QuantileTransformer转换几列,但结果似乎并不方便。此外,即使对于较小的数据集,它也取决于列顺序。
我知道可以为每个功能创建一个单独的转换器,但是当我阅读documentation时,该功能应该接受(n_samples,n_features)个对象。
这是google colab重现结果。
有没有一种方法可以应用QuantileTransformer并获得一致的结果(以便将相同的原始值映射到相同的转换值而不是一对多)?
import pandas as pd
from sklearn.preprocessing import QuantileTransformer
def unique_values(x):
return x.unique().tolist()
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['latitude', 'longitude']
qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())
结果:
{'latitude': 840, 'latitude__qt': 827}
nunique unique_values
latitude
34.07 102.0 [0.9865865865865866, 0.9719719719719734, 0.963...
34.08 101.0 [0.980980980980981, 0.9474474474474475, 0.9214...
34.06 94.0 [0.9846403596403596, 0.932932932932933, 0.9294...
34.10 88.0 [0.9891329870516945, 0.9882813721745806, 0.987...
34.05 87.0 [0.9719719719719734, 0.9269269269269284, 0.923...
{'longitude': 827, 'longitude__qt': 842}
nunique unique_values
longitude
-118.31 50.0 [0.6276276276276276, 0.5721203907954981, 0.511...
-118.32 49.0 [0.5369214480068981, 0.504004004004004, 0.4804...
-118.12 49.0 [0.5418393378488674, 0.5415415415415415, 0.540...
-117.25 48.0 [0.5335335335335335, 0.5327261051927988, 0.452...
-118.15 47.0 [0.5495495495495496, 0.5418393378488674, 0.541...
不同的列顺序:
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['longitude', 'latitude']
qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())
结果:
{'longitude': 827, 'longitude__qt': 827}
nunique unique_values
longitude
-124.35 1.0 [9.999999977795539e-08]
-118.31 1.0 [0.5900900900900901]
-118.41 1.0 [0.531031031031031]
-118.40 1.0 [0.5355355355355356]
-118.39 1.0 [0.542542542542544]
{'latitude': 840, 'latitude__qt': 842}
nunique unique_values
latitude
37.74 2.0 [0.7602602602602603, 0.7577577577577578]
37.37 2.0 [0.6806806806806807, 0.6816816816816816]
32.54 1.0 [9.999999977795539e-08]
38.34 1.0 [0.8848848848848849]
38.36 1.0 [0.8873873873873874]
最佳答案
问题是您没有更改列的顺序,而只是重命名了列。如果这样做,您将获得正确的结果。我还提供了一个random_state
参数,以供参考。
import pandas as pd
from sklearn.preprocessing import QuantileTransformer
def unique_values(x):
return x.unique().tolist()
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['latitude', 'longitude']
# Change the column order
df = df[columns]
qt = QuantileTransformer(random_state = 0)
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())
产生相同的输出,只是输出顺序不同(这是您想要的,因为切换了列顺序),如
df = pd.read_csv('https://storage.googleapis.com/ml_universities/california_housing_train.csv', usecols=[0, 1])
columns = ['longitude', 'latitude']
df = df[columns] # Changing the column order
qt = QuantileTransformer()
q_features = qt.fit_transform(df)
suffix = '__qt'
qdf = df.join(pd.DataFrame(q_features, columns=columns), rsuffix=suffix)
for col in columns:
q_col = f'{col}{suffix}'
print({col: qdf[col].nunique(), q_col: qdf[q_col].nunique()})
gdf = qdf.groupby(col)[q_col].agg([pd.Series.nunique, unique_values])
print(gdf.sort_values('nunique', ascending=False).head())