在PowerShell中比较两个更大的文本数组

本文介绍了在PowerShell中比较两个更大的文本数组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个数组，我想区别一下。我在COMPARE-OBJECT上取得了一些成功，但是对于大型阵列来说太慢了。在此示例中，$ ALLVALUES和$ ODD是我的两个数组。

I have two arrays that I would like to take the difference between. I had some success with COMPARE-OBJECT, but is too slow for larger arrays. In this example $ALLVALUES and $ODD are my two arrays.

我以前能够使用FINDSTR
ex高效地完成此操作。 FINDSTR / V /G:ODD.txt ALLVALUES.txt> EVEN.txt FINDSTR在2秒内完成了110,000个元素的处理。（甚至必须从磁盘读取和写入）

I used to be able to do this efficiently using FINDSTRex. FINDSTR /V /G:ODD.txt ALLVALUES.txt > EVEN.txt FINDSTR finished this in under 2 seconds for 110,000 elements. (even had to read and write from the disk)

我试图恢复FINDSTR的性能，它将在ALLVALUES.txt中为我提供一切匹配ODD.txt（在这种情况下为我提供了EVEN值）

I'm trying to get back to the FINDSTR performance where it would give me everything in ALLVALUES.txt that did NOT match ODD.txt (giving me the EVEN values in this case)

注意：这个问题与ODD或EVEN无关，只是一个可以快速直观地看到的实际示例

NOTE: This question is not about ODD or EVEN, only a practical example that can be quickly and visually verified that it is working as desired.

这是我一直在使用的代码。使用COMPARE-OBJECT，100,000花费了200秒，而我的计算机上FINDSTR花费了2秒。我认为 PowerShell 中有一种更为优雅的方法。谢谢您的帮助。

Here is the code that I have been playing with. Using COMPARE-OBJECT, 100,000 took like 200 seconds vs 2 seconds for FINDSTR on my computer. I'm thinking there is a much more elegant way to do this in PowerShell. Thanks for your help.

# -------  Build the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ALLVALUES = while ($MIN -le $MAX) 
{
   "$PREFIX{0:D6}" -f $MIN++
}


# -------  Build the ODD values from the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ODD = while ($MIN -le $MAX) 
{
   If ($MIN%2) {
      "$PREFIX{0:D6}" -f $MIN++
   }
  ELSE {
    $MIN++
   }
}

Measure-Command{$EVEN = Compare-Object -DifferenceObject $ODD -ReferenceObject $ALLVALUES -PassThru}

推荐答案

数组是对象，而不仅仅是findstr进程的简单文本段。

字符串数组最快的区别是.NET3.5 + 。

The arrays are objects, not just simple blobs of text that findstr processes.
The fastest diff of string arrays is .NET3.5+ HashSet.SymmetricExceptWith.

$diff = [Collections.Generic.HashSet[string]]$a
$diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
$diffArray = [string[]]$diff

使用您的数据在i7 CPU上的100k元素为46 ms。

46 ms for 100k elements on i7 CPU using your data.

上面的代码省略重复值，因此如果输出中需要这些值，我认为我们将不得不使用慢得多的手动枚举。

The above code omits duplicate values so if those are needed in the output, I think we'll have to use a much much slower manual enumeration.

function Diff-Array($a, $b, [switch]$unique) {
    if ($unique.IsPresent) {
        $diff = [Collections.Generic.HashSet[string]]$a
        $diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
        return [string[]]$diff
    }
    $occurrences = @{}
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = [Math]::Abs($_.value)
        while ($cnt--) { $_.key }
    }
}

用法：

$diffArray = Diff-Array $ALLVALUES $ODD

340毫秒，比哈希集慢8倍，但比Compare-Object快110倍！

340 ms, 8x slower than hashset but 110x faster than Compare-Object!

最后，我们可以为字符串/数字数组制作一个更快的Compare-Object：

And lastly, we can make a faster Compare-Object for arrays of strings/numbers:

function Compare-StringArray($a, $b, [switch]$unsorted) {
    $occurrences = if ($unsorted.IsPresent) { @{} }
                   else { [Collections.Generic.SortedDictionary[string,int]]::new() }
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = $_.value
        if ($cnt) {
            $diff = [PSCustomObject]@{
                InputObject = $_.key
                SideIndicator = if ($cnt -lt 0) { '=>' } else { '<=' }
            }
            $cnt = [Math]::Abs($cnt)
            while ($cnt--) {
                $diff
            }
        }
    }
}

100k元素：比Compare-Object快20-28倍，完成2100ms / 1460ms（未排序）

10k元素：快2-3x，完成210ms / 162ms（未排序）

100k elements: 20-28x faster than Compare-Object, completes in 2100ms / 1460ms (unsorted)
10k elements: 2-3x faster, completes in 210ms / 162ms (unsorted)

这篇关于在PowerShell中比较两个更大的文本数组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！