python - python集合中的哈希表冲突？

如何在 python 中设置 2 相同的元素？是python的bug吗？

type(data_chunks)
<class 'set'>

len(data_chunks)
43130

same = [x for x in data_chunks if x.md5==chunk.md5]
[<Model.Chunk.Chunk o...x0DB40870>, <Model.Chunk.Chunk o...x0DB40870>]

len(same)
2

same[0] is same[1]
True

same[0] == same[1]
True

len(set(same))
1

但是当我从中构建字典时，重复被删除了!

len({k:k.product_id for k in data_chunks})
43129

为什么它适用于字典而不适用于集合？我认为这是哈希表中的冲突，但实际上重复的对象是同一个对象，因此在添加下一个元素时在集合查找中找不到它(？)

附加信息:

Chunk 已经定义了 __hash__ 和 __eq__ 方法

python 3.7.2

我已经意识到 Chunk 有一些会引发错误的属性 - 它不应该重要，因为它们不被称为

代码行是:data_chunks = data_chunks | another_set

在 vscode 中调试 session 期间的交互式提示

运行代码时有时会发生

但在这个调试 session 期间，从 data_chunks 创建新集的长度总是相同的

编辑

块实现

class Chunk(object):
    def __init__(self,
                 md5,
                 size=None,
                 compressedMd5=None,
                 # ... (more elements)
                 product_id=None):
       self.md5 = md5
       self.product_id = product_id
       # (etc.)

    def __eq__(self, other):
        if self.compressedMd5:
            return self.compressedMd5 == other.compressedMd5 and self.product_id == other.product_id
        return self.md5 == other.md5 and self.product_id == other.product_id

    def __hash__(self):
        return self.name.__hash__()

    @property
    def name(self):
        return self.compressedMd5 if self.compressedMd5 is not None else self.md5

==================

编辑
好的，代码如下:
repository - json 描述符
chunking_strategy = ... - 主要是存储设置的类，例如块将被压缩。
result_handler = Strategy.DefaultResultHandler(repository)在存储库中生成块对象的唯一哈希:块和相应的文件映射。 之后它会调用压缩作业，然后设置compressedMd5 和现有块的其他属性。

generation_strategy = Strategy.CachingGenerationStrategy(
            result_handler,
            Settings().extra_io_threads,
        )

data_chunks = Strategy.DepotChunker(repository, chunking_strategy, generation_strategy)()
在 DeputChunker init 上:todo 分块作业是根据 chunking_strategy 设置准备的。然后 generation_strategy.__call__ 方法处理所有作业:根据先前定义的 Chunk 对象将文件切成小块。这是在 multiprocessing.Pool 中完成的。创建物理块后，检查 md5 并使用 compressedMd5 、 compressedSize 和 product_id 更新块对象。
然后(在更改 Chunk 对象之后只是 )将块对象添加到 set。
这个集合是从 DepotChunker 返回的

然后压缩的块保存在缓存中。

然后所有的data_chunks都在寻找小尺寸的小对象，从中创建由合并的小文件组成的物理块(在内存缓冲区中)。让我们称它们为 smallFilesChunks。它们被添加到 data_chunks :
sfChunk = Chunk( sfCompressedContentMD5, # yes I see that this is compressed md5 - it was intended for some reason I don't know size=sfSize, compressedMd5=sfCompressedContentMD5, compressedSize=sfCompressedSize, product_id=productId ) if not sfChunk in data_chunks: # purly sanity check data_chunks.add(sfcChunk)

最后，元文件被创建，它们也被分块并添加到 data_chunks
然后元文件被转储，并且它们也被分块。
for depot in manifest_depots: data_chunks = data_chunks | simpleChunker(depot)

此时调试器 session 从一开始就被记录下来了

最佳答案

一个问题是 __eq__ 对于一对具有 compressedMd5 而另一个没有(即其 compressedMd5 设置为 None )的对象是不可交换的。这意味着可以构造两个对象 a 和 b 使得 a == b 和 b != a 同时。

一个相关的问题是 __eq__ 和 __hash__ 在类似情况下彼此不一致(如果 __eq__ 是 other.compressedMd5 ， self.compressedMd5 将拒绝查看 None 。)

可变性也可能是一个问题，如以下示例所示:
class Chunk(object): def __init__(self, md5): self.md5 = md5 def __hash__(self): return hash(self.md5) s = set() chunk = Chunk('42') s.add(chunk) chunk.md5 = '123' s.add(chunk) print(s)

在我的电脑上，这会产生 set([<__main__.Chunk object at 0x106d03390>, <__main__.Chunk object at 0x106d03390>]) ，即同一个对象在集合中出现两次。

如果您更改 md5 或 set/unset/change compressedMd5 ，您的代码中可能会发生类似的事情。
关于python - python集合中的哈希表冲突？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/57162495/