问题描述
我有一个包含四个字段的模型。如何从数据库中删除重复的对象?
I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman对似乎合适,但是我不确定如何将其扩展到每个对象有四个字段进行比较的情况。
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
谢谢
W。
推荐答案
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
您不应该经常这样做。在数据库上使用 unique_together
约束。
You shouldn't do it often. Use unique_together
constraints on database instead.
这将使记录具有最大的 id
在数据库中。如果要保留原始记录(第一个记录),请使用 models.Min
稍微修改一下代码。您还可以使用完全不同的字段,例如创建日期或其他内容。
This leaves the record with the biggest id
in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min
. You can also use completely different field, like creation date or something.
底层SQL
注释django时,ORM在查询中使用的所有模型字段上使用 GROUP BY
语句。因此,使用 .values()
方法。 GROUP BY
将对所有具有相同值的记录进行分组。重复的副本(对于 unique_fields
,有多个 id
)后来在 HAVING
语句由 .filter()
在带注释的 QuerySet
上生成。
When annotating django ORM uses GROUP BY
statement on all model fields used in the query. Thus the use of .values()
method. GROUP BY
will group all records having those values identical. The duplicated ones (more than one id
for unique_fields
) are later filtered out in HAVING
statement generated by .filter()
on annotated QuerySet
.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
重复的记录随后在中删除
循环,每个组中最频繁的循环除外。
The duplicated records are later deleted in the for
loop with an exception to the most frequent one for each group.
空.order_by()
可以肯定的是,在汇总 QuerySet $之前,添加一个空的
.order_by()
调用总是明智的。 c $ c>。
Just to be sure, it's always wise to add an empty .order_by()
call before aggregating a QuerySet
.
用于订购 QuerySet
的字段也包含在 GROUP BY $中c $ c>语句。空的
.order_by()
会覆盖模型的 Meta
中声明的列,结果它们不包含在SQL查询中(例如,按日期进行默认排序可能会破坏结果。)
The fields used for ordering the QuerySet
are also included in GROUP BY
statement. Empty .order_by()
overrides columns declared in model's Meta
and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
您可能目前无需覆盖它,但是以后有人可能会添加默认排序,因此会破坏您的宝贵资源删除重复的代码甚至都不知道。是的,我确定您有100%的测试覆盖率...
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
只需添加空的 .order_by()
安全。 ;-)
Just add empty .order_by()
to be safe. ;-)
交易
当然,您应该考虑全部
这篇关于在Django ORM中删除重复项-多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!