我正在尝试使用 Haskell 查找文件中字符的频率。我希望能够处理〜500MB大小的文件。

到现在为止我一直在尝试

  • 它可以完成工作,但是由于将文件解析256次而有点慢
    calculateFrequency :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
    
  • 我也尝试过使用Data.Map,但是程序内存不足(在ghc解释器中)。
    import qualified Data.ByteString.Lazy as L
    import qualified Data.Map as M
    
    calculateFrequency' :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
    
  • 最佳答案

    @Alex的答案很好,但是只有256个值(索引),数组应该更好

    import qualified Data.ByteString.Lazy as L
    import qualified Data.Array.Unboxed as A
    import qualified Data.ByteString as B
    import Data.Int
    import Data.Word
    
    fq :: L.ByteString -> A.UArray Word8 Int64
    fq = A.accumArray (+) 0 (0, 255) . map (\c -> (c, 1)) . concat . map B.unpack . L.toChunks
    
    main = L.getContents >>= print . fq
    

    @alex代码占用(对于我的示例文件)24.81段,使用数组占用7.77段。

    更新:

    尽管Snoyman解决方案更好,但可以避免unpack的改进
    fq :: L.ByteString -> A.UArray Word8 Int64
    fq = A.accumArray (+) 0 (0, 255) . toCounterC . L.toChunks
         where toCounterC [] = []
               toCounterC (x:xs) = toCounter x (B.length x) xs
               toCounter  _ 0 xs = toCounterC xs
               toCounter  x i xs = (B.index x i', 1): toCounter x i' xs
                                   where i' = i - 1
    

    加速约50%。

    更新:

    IOVector用作Snoyman就是Conduit版本(确实快一点,但这是原始代码,最好使用Conduit)
    import           Data.Int
    import           Data.Word
    import           Control.Monad.IO.Class
    import qualified Data.ByteString.Lazy          as L
    import qualified Data.Array.Unboxed            as A
    import qualified Data.ByteString               as B
    import qualified Data.Vector.Unboxed.Mutable   as V
    
    fq :: L.ByteString -> IO (V.IOVector Int64)
    fq xs =
         do
           v <- V.replicate 256 0 :: IO (V.IOVector Int64)
           g v $ L.toChunks xs
           return v
         where g v = toCounterC
                     where toCounterC [] = return ()
                           toCounterC (x:xs) = toCounter x (B.length x) xs
                           toCounter  _ 0 xs = toCounterC xs
                           toCounter  x i xs = do
                                                 let i' = i - 1
                                                     w  = fromIntegral $ B.index x i'
                                                 c <- V.read v w
                                                 V.write v w (c + 1)
                                                 toCounter x i' xs
    
    main = do
              v <- L.getContents >>= fq
              mapM_ (\i -> V.read v i >>= liftIO . putStr . (++", ") . show) [0..255]
    

    关于haskell - 字符频率,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/21132026/

    10-13 09:09
    查看更多