

我有一个大小约为10 GB的字符串(c ..占用大量RAM).问题是,我需要执行gsub之类的字符串操作并对其进行分割.我注意到Ruby会在某个时候停止工作"(尽管不会产生任何错误).

I have a string that is ~10 GB in size (huge RAM usage ofc..).The thing is, I need to perform string operations like gsub and split on it.I noticed that Ruby will just "stop working" at some point (without yielding any errors though).



# I will try to split the string using .split:
# but Ruby will instead just return an array with
# the full unsplitted string itself...

# let's break this down:
# each of those attempts doesn't cause problems and
# returns arrays with thousands or even millions of items (lines)

# starting from here, problems will occur

我正在使用Ruby MRI 1.8.7,这是怎么了为什么Ruby无法对巨大的字符串执行字符串操作?那么这里有什么解决方案?

I'm using Ruby MRI 1.8.7,what is wrong here?Why is Ruby not able to perform string operations on huge strings?And what is a solution here?


The only solution I came up with is to "loop" through the string using [0..9], [10..19],... and to perform the string operations part by part. However this seems unreliable, for example what if my split delimiter is very long and falls between two "parts".

另一个可行的解决方案是像str.each_line {..}一样迭代字符串.但这只是替换了换行符.

Another solution that actually works fine is to iterate the string by like str.each_line {..}.However this just replaces newline delimiters.

感谢所有这些答案.就我而言,巨大的10 GB STRING"实际上是从互联网上下载的.它包含由特定序列(在大多数情况下为简单的换行符)定界的数据.在我的场景中,我将10 GB文件的每个元素与脚本中已经拥有的另一个(较小的)数据集进行了比较.我感谢所有建议.

Thanks for all those answers.In my case, the "HUGE 10 GB STRING" is actually a download from the internet.It contains data that is delimited by a specific sequence (in most cases a simple newline).In my scenario I compare EACH ELEMENT of the 10 GB file to another (smaller) data-set that I already have in my script. I appreciate all suggestions.



Here's a benchmark against a real-life log file. Of the methods used to read the file, only the one using foreach is scalable because it avoids slurping the file.


Using lazy adds overhead, resulting in slower times than map alone.

请注意,只要处理速度快,foreach就在其中,并提供可扩展的解决方案. Ruby不在乎文件是几千行还是一千亿TB,它一次只能看到一行.请参阅"为什么拖拽"文件不是一个好习惯? /a>"以获取有关读取文件的一些相关信息.

Notice that foreach is right in there as far as processing speed goes, and results in a scalable solution. Ruby won't care if the file is a zillion lines or a zillion TB, it's still only seeing a single line at a time. See "Why is "slurping" a file not a good practice?" for some related information about reading files.


People often gravitate to using something that pulls in an entire file at once, then splitting it into parts. That ignores the job Ruby then has to do to rebuild the array based on line ends using split or something similar. That adds up, and is why I think foreach pulls ahead.

还要注意,在两次基准测试之间,结果略有不同.这可能是由于作业在运行时Mac Pro上正在运行的系统任务所致.重要的是,这表明差异是一种洗礼,向我确认使用foreach是处理大文件的正确方法,因为如果输入文件超出可用内存,它不会杀死计算机.

Also notice that the results shift a little between the two benchmark runs. This is probably due to system tasks running on my Mac Pro as the jobs are running. The important thing is that shows the difference is a wash, confirming to me that using foreach is the right way to process big files, because it's not going to kill the machine if the input file exceeds available memory.

require 'benchmark'

REGEX = /\bfoo\z/
LOG = 'debug.log'
N = 1

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def lazy_map(filename)
  File.open("lazy_map.out", 'w') do |fo|
    fo.puts File.readlines(filename).lazy.map { |li|
      li.gsub(REGEX, 'bar')

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def map(filename)
  File.open("map.out", 'w') do |fo|
    fo.puts File.readlines(filename).map { |li|
      li.gsub(REGEX, 'bar')

# "Reads the entire file specified by name as individual lines, and returns
# those lines in an array."
# As a result of returning all the lines in an array this isn't scalable. It
# will work if Ruby has enough memory to hold the array plus all other
# variables and its overhead.
def readlines(filename)
  File.open("readlines.out", 'w') do |fo|
    File.readlines(filename).each do |li|
      fo.puts li.gsub(REGEX, 'bar')

# This is completely scalable because no file slurping is involved.
# "Executes the block for every line in the named I/O port..."
# It's slower, but it works reliably.
def foreach(filename)
  File.open("foreach.out", 'w') do |fo|
    File.foreach(filename) do |li|
      fo.puts li.gsub(REGEX, 'bar')

puts "Ruby version: #{ RUBY_VERSION }"
puts "log bytes: #{ File.size(LOG) }"
puts "log lines: #{ `wc -l #{ LOG }`.to_i }"

2.times do
  Benchmark.bm(13) do |b|
    b.report('lazy_map')  { lazy_map(LOG)  }
    b.report('map')       { map(LOG)       }
    b.report('readlines') { readlines(LOG) }
    b.report('foreach')   { foreach(LOG)   }

%w[lazy_map map readlines foreach].each do |s|
  puts `wc #{ s }.out`


Ruby version: 2.0.0
log bytes: 733978797
log lines: 5540058
                    user     system      total        real
lazy_map       35.010000   4.120000  39.130000 ( 43.688429)
map            29.510000   7.440000  36.950000 ( 43.544893)
readlines      28.750000   9.860000  38.610000 ( 43.578684)
foreach        25.380000   4.120000  29.500000 ( 35.414149)
                    user     system      total        real
lazy_map       32.350000   9.000000  41.350000 ( 51.567903)
map            24.740000   3.410000  28.150000 ( 32.540841)
readlines      24.490000   7.330000  31.820000 ( 37.873325)
foreach        26.460000   2.540000  29.000000 ( 33.599926)
5540058 83892946 733978797 lazy_map.out
5540058 83892946 733978797 map.out
5540058 83892946 733978797 readlines.out
5540058 83892946 733978797 foreach.out


The use of gsub is innocuous since every method uses it, but it's not needed and was added for a bit of frivolous resistive loading.

