问题描述
已更新
我已经上传了一个虚拟数据集,链接位于
它总共有 4类和 df.object.value_counts()
:
human 2313号车猫5狗3
我想正确地对多类对象检测数据集进行 K折叠
验证.
初始方法
为了实现适当的k折验证拆分,我考虑了对象计数
和边界框
的数量.据我了解, K-fold
拆分策略主要取决于数据集(元信息).但是目前,对于这些数据集,我已经尝试过如下操作:
skf = StratifiedKFold(n_splits = 3,shuffle = True,random_state = 101)df_folds = main_df [['image_id']].copy()df_folds.loc [:,'bbox_count'] = 1df_folds = df_folds.groupby('image_id').count()df_folds.loc [:,'object_count'] = main_df.groupby('image_id')['object'].nunique()df_folds.loc [:,'stratify_group'] = np.char.add(df_folds ['object_count'].values.astype(str),df_folds ['bbox_count'].apply(lambda x:f'_ {x//15}').values.astype(str))df_folds.loc [:,'fold'] = 0对于fold_number,(train_index,val_index)枚举(skf.split(X = df_folds.index,y = df_folds ['stratify_group'])):df_folds.loc [df_folds.iloc [val_index] .index,'fold'] = fold_number
拆分后,我检查了一下是否可以正常使用.到目前为止看来还可以.
所有折叠均包含分层的 k-fold
个样本, len(df_folds [df_folds ['fold'] == fold_number] .index)
,并且彼此之间无交集, set(A).intersection(B)
,其中 A
和 B
是两折.但问题似乎是:
Fold 0 has total:18 + 2 + 3 = 23 bbox折数1总计:2 + 11 = 13 bbox折数2总计:5 + 3 = 8 bbox
关注
但是,我无法确定这通常是否适合此类任务.我想要一些建议.上面的方法可以吗?或任何问题?还是有更好的方法!任何形式的建议将不胜感激.谢谢.
在创建交叉验证拆分时,我们关心的是创建折叠,这些折叠具有良好的各种案例"分布.数据中遇到的问题.
在您的情况下,您决定根据车辆的数量和边界框的数量来确定折页数,这是一个不错的选择,但选择有限.因此,如果您可以使用数据/元数据识别特定情况,则可以尝试使用它创建更智能的折叠.
最明显的选择是在折叠中平衡对象类型(类),但是您可以走得更远.
这里是主要思想,假设您有一些图像,其中大多数在法国遇到汽车,而其他一些在美国遇到汽车,则可以用来制作折页,每张折折的法国和美国汽车数量相等折叠.在天气等情况下也可以这样做.因此,每个折叠都将包含代表性数据以供学习,这样您的网络就不会因您的任务而有偏差.结果,您的模型将对数据中的这种潜在的现实生活变化更加健壮.
那么,您可以在交叉验证策略中添加一些元数据以创建更好的简历吗?如果不是这种情况,是否可以使用数据集的x,y,w,h列获取有关潜在极端情况的信息?
然后,您应尝试使样本的折痕平衡,以便在相同的样本量下评估您的分数,这将减少差异并在末尾提供更好的评估.
Updated
I've uploaded a dummy data set, link here. The df.head()
:
It has 4 class in total and df.object.value_counts()
:
human 23
car 13
cat 5
dog 3
I want to do properly K-Fold
validation splits over a multi-class object detection data set.
Initial Approach
To achieve proper k-fold validation splits, I took the object counts
and the number of bounding box
into account. I understand, the K-fold
splitting strategies mostly depends on the data set (meta information). But for now with these dataset, I've tried something like as follows:
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=101)
df_folds = main_df[['image_id']].copy()
df_folds.loc[:, 'bbox_count'] = 1
df_folds = df_folds.groupby('image_id').count()
df_folds.loc[:, 'object_count'] = main_df.groupby('image_id')['object'].nunique()
df_folds.loc[:, 'stratify_group'] = np.char.add(
df_folds['object_count'].values.astype(str),
df_folds['bbox_count'].apply(lambda x: f'_{x // 15}').values.astype(str)
)
df_folds.loc[:, 'fold'] = 0
for fold_number, (train_index, val_index) in enumerate(skf.split(X=df_folds.index, y=df_folds['stratify_group'])):
df_folds.loc[df_folds.iloc[val_index].index, 'fold'] = fold_number
After the splitting, I've checked to ensure if it's working. And it seems Ok so far.
All the folds contain stratified k-fold
samples, len(df_folds[df_folds['fold'] == fold_number].index)
and no intersection to each other, set(A).intersection(B)
where A
and B
are the index value (image_id
) of two folds. But the issue seems like:
Fold 0 has total: 18 + 2 + 3 = 23 bbox
Fold 1 has total: 2 + 11 = 13 bbox
Fold 2 has total: 5 + 3 = 8 bbox
Concern
However, I couldn't ensure whether it's the proper way for this type of task in general. I want some advice. Is the above approach OK? or any issue? or there is some better approach! Any sorts of suggestions would be appreciated. Thanks.
When creating a cross-validation split, we care about creating folds which have a good distribution of the various "cases" encountered in the data.
In your case, you decided to base your folds on the number of cars and the number of bounding boxes which is a good but limited choice. So, if you can identify specific cases using your data/metadata, you might try to create smarter folds using it.
The most obvious choice is to balance object types (classes) in your folds, but you could go further.
Here is the main idea, let's say you have images with cars encountered mostly in France, and others with cars encountered mostly in the US, it could be used to create good folds with a balanced number of french and us cars in each fold. Same could be done with weather conditions etc. Thus, each fold will contain representative data to learn from so that your network won't be biased for your task. As a result, your model will be more robust to such potential real life changes in the data.
So, can you add some metadata to your cross-validation strategy to create a better CV? If it's not the case, can you get information about potential corner cases using the x, y, w, h columns of your dataset?
Then you should try to have balanced folds in terms of samples so that your scores are evaluated on the same sample size which will reduce variance and provide a better evaluation at the end.
这篇关于用于多类物体检测的分层K形折叠?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!