本文介绍了如何将大型F#记录数组保存到文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我想将大的f#记录数组(> 10,000,000个元素)保存到磁盘,以便稍后将数组重新加载到内存中很容易.我使用Visual F#2010中的以下简单函数进行技术计算:

I want to save a large f# array of records (> 10,000,000 elements) to disk so that it is easy to reload the array into memory later. I used the following simple function from Visual F# 2010 for technical computing:

let save filename x =
    use stream = new FileStream(filename, FileMode.Create)
    BinaryFormatter().Serialize(stream, x)

type Test = { a : int; b : int}

let x = [| for i in 1..6 do
            let a=i
            let b=i*i
            yield {a=a;b=b}|]

save "file.dat" x

当我这样做(使用真实数据)时,我得到了错误:

When I do this (with the real data) I get the error:

System.Runtime.Serialization.SerializationException: The internal array cannot expand to greater than Int32.MaxValue elements.

现在,我的解决方案是转换为Deedle,然后另存为csv,但我认为存在一种用于保存/重新加载的计算效率更高的选项,不需要从csv重建数组.

Right now, my solution is to convert to Deedle and then save as a csv, but I presume that there is a more computationally efficient option for saving/reloading that does not require rebuilding the array from csv.

let x2 = x |> Frame.ofRecords
x2.SaveCsv("file.csv")

推荐答案

将10,000,000行写入文本文件不是问题.这是一个简单的演示:

Writing 10,000,000 lines to a text file isn't a problem. Here's a simple demo:

> let lines = Seq.initInfinite (fun i -> sprintf "%i, %i, -%i" i (i * 2) i);;

val lines : seq<string>

> open System.IO;;
> #time;;

--> Timing now on

> File.WriteAllLines(@"test.csv", lines |> Seq.take 10000000);;
Real: 00:00:20.420, CPU: 00:00:20.343, GC gen0: 3528, gen1: 3, gen2: 1
val it : unit = ()

如您所见,仅需20秒.

As you can see, that takes a mere 20 seconds.

读回行也不错:

> let roundTripped = File.ReadLines @"test.csv";;
Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0

val roundTripped : System.Collections.Generic.IEnumerable<string>

如您所见,这是瞬时发生的,因为roundTripped是作为延迟计算的序列加载的.

As you can see, this happens instantaneously, because roundTripped is loaded as a lazily evaluated sequence.

仍然可以枚举值:

> roundTripped |> Seq.iter (printfn "%s")

(为清晰起见,打印输出被截断;实际上有1000万行.)

(printout truncated for clarity; there are literally 10 million lines.)

...
9999997, 19999994, -9999997
9999998, 19999996, -9999998
9999999, 19999998, -9999999
Real: 00:03:43.995, CPU: 00:01:15.390, GC gen0: 594, gen1: 23, gen2: 3
val it : unit = ()

这花费了更长的时间,但是我怀疑这主要是因为在控制台上打印往往会花费一些时间.

This takes a lot longer, but I suspect it's mainly because printing to the console tends to take time.

这些实验是在我3岁的联想X1 Carbon(一种相当主流的硬件)上进行的.

These experiments were made on my 3-year old Lenovo X1 Carbon - a fairly mainstream piece of hardware.

因此,编写或读取数百万个文本行没有问题,但是请注意,我避免使用数组,而倾向于延迟计算的序列.

Thus, there's no problem with writing or reading millions of text lines, but do notice that I've avoided arrays in favour of lazily evaluated sequences.

使用记录不会改变以上结论.我不敢在.NET序列化上设计任何种类的持久性解决方案(由于潜在的版本问题),因此我仍然会为此目的转换成其他格式.

Using records doesn't change the above conclusions. I wouldn't dare to design any sort of long-lasting persistence solution on .NET serialization (due to potential versioning issues), so I'd still convert to some other format for that purpose.

要坚持使用CSV:

type Test = { A : int; B : int }

let records = Seq.initInfinite (fun i -> { A = i; B = -i })
let csvs = records |> Seq.map (fun x -> sprintf "%i, %i" x.A x.B)

记录的写入和读取时间与上述报告大致相同.

Records can be written and read in approximately the same times as reported above.

这篇关于如何将大型F#记录数组保存到文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-07 02:11