本文介绍了MySQL重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张桌子,上面有一些重复的信息:ID,姓名,姓氏,出生,PersonalKey,个人信息,方向,来源.

I have a table with some repeated information: Id, Name, Lastname, Birth, PersonalKey, Personal Info, Direction, Source.

消息来源告诉我信息的来源.

Where source tells me where the information came from.

重复的信息具有唯一的ID,我需要擦除重复的信息.但是,我优先于某些源信息,而我需要保留这些源信息,而其他信息则要删除.

The repeated information has unique id, and I need to erase the duped information.But, I have priority over some Source information that i need to be the one that stays and the other erased.

另一件事是,另一个源信息包含一些我想保留的信息,因此我需要将PersonalKey重新填充到要保留的信息中,并删除重复的信息.

Other thing is that another Source information have some information that the one that i want to stay doesnt have so i need to refill the PersonalKey to the one thats going to stay and erase the repeated ones.

名为Pruebas的表

Table named Pruebas

---Id, Name, Firstname, Lastname, Birth, RFC, Source, PersonalKey---
---2,Juan,Garcia,Escobeddo,1983-08-04,GAED87393, DRV484930, 34233--
---3,Juan,Garcia,Escobedo,1987-08-04,GAED87393, FIN484930, --
---4,Juan,Garcia,Escobedo,1987-08-04,GAED87393, SA484930, --

如您所见:

  • ID是唯一的
  • 姓名,名字和姓氏重复
  • id 2具有PersonalKey值,但3和4没有
    • 我希望保留带有'FIN%'源的那个,而另一个删除,但是首先我需要确保剩下的行获得PersonalKey值(IOW,我不想丢失PersonalKey值).
    • The IDs are unique
    • The name, firstname and lastname are repeated
    • The id 2 has a PersonalKey value, but 3 and 4 don't
      • I want the one with the 'FIN%' source to stay and the other ones erased, but first I need to make sure the row that remains gets the PersonalKey value (IOW, I don't want to lose the PersonalKey value).

      谢谢.

      推荐答案

      我能想到的最直接的解决方案是将PersonalKey复制到其他重复的行,然后删除所有与'FIN%'不匹配的行

      The most straightforward solution I can think of is to copy the PersonalKey to other duplicate rows, and then delete all rows that don't match 'FIN%'.

      UPDATE Pruebas p1 JOIN Pruebas p2
       ON (SOUNDEX(CONCAT(p1.Name, p2.Firstname, p3.Lastname))
         = SOUNDEX(CONCAT(p2.Name, p2.Firstname, p2.Lastname)))
      SET p1.PersonalKey = p2.PersonalKey
      WHERE p2.PersonalKey IS NOT NULL;
      
      DELETE FROM Pruebas WHERE Source NOT LIKE 'FIN%';
      

      我正在使用 SOUNDEX() .

      I'm showing an approximate match expression for the join, using SOUNDEX().

      从您留下的其他评论中可以看出,您有很多变化和不确定性.在这种情况下,无法自动执行清除和重复数据删除-否则,至少自动清除会比仅手动执行更为复杂且更难解决.

      I see from other comments you have left that you have a lot of variation and uncertainty. In this case, there's no way to automate the cleanup and de-duplication -- or at least automatic cleanup will be more complex and harder to get right than just doing it manually.

      关于您的查询需要花费多个小时的评论:是的,实际上并不期望它有效. JOIN表达式不是 sargable -也就是说,它不能利用索引.您可以通过添加一个额外的列来物理存储名称,名字,姓氏的SOUNDEX()值,从而提高效率.然后在该列上创建索引.

      Re your comment that the query takes many hours: yes, it's really not expected to be efficient. The JOIN expression is not sargable -- that is, it cannot take advantage of an index. You could make it more efficient by adding an extra column to physically store the SOUNDEX() value of the name,firstname,lastname. Then create an index on that column.

      但是SOUNDEX()不能保证找到所有可能的拼写错误.您面临着无法完全自动化的数据清除任务. 任何数据清理解决方案都需要手动操作.

      But SOUNDEX() isn't guaranteed to find all possible misspellings anyway. You are facing a data cleanup task that cannot be automated fully. Any solution to data cleanup requires manual work.

      这篇关于MySQL重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 12:18