问题描述
我有一个大小约为10 GB的字符串(c ..占用大量RAM).问题是,我需要执行gsub之类的字符串操作并对其进行分割.我注意到Ruby会在某个时候停止工作"(尽管不会产生任何错误).
I have a string that is ~10 GB in size (huge RAM usage ofc..).The thing is, I need to perform string operations like gsub and split on it.I noticed that Ruby will just "stop working" at some point (without yielding any errors though).
示例:
str = HUGE_STRING_10_GB
# I will try to split the string using .split:
str.split("\r\n")
# but Ruby will instead just return an array with
# the full unsplitted string itself...
# let's break this down:
# each of those attempts doesn't cause problems and
# returns arrays with thousands or even millions of items (lines)
str[0..999].split("\r\n")
str[0..999_999].split("\r\n")
str[0..999_999_999].split("\r\n")
# starting from here, problems will occur
str[0..1_999_999_999].split("\r\n")
我正在使用Ruby MRI 1.8.7,这是怎么了为什么Ruby无法对巨大的字符串执行字符串操作?那么这里有什么解决方案?
I'm using Ruby MRI 1.8.7,what is wrong here?Why is Ruby not able to perform string operations on huge strings?And what is a solution here?
我想出的唯一解决方案是使用[0..9],[10..19],...遍历"字符串,并部分地执行字符串操作.但是,这似乎并不可靠,例如,如果我的分隔定界符很长并且位于两个部分"之间,该怎么办?
The only solution I came up with is to "loop" through the string using [0..9], [10..19],... and to perform the string operations part by part. However this seems unreliable, for example what if my split delimiter is very long and falls between two "parts".
另一个可行的解决方案是像str.each_line {..}一样迭代字符串.但这只是替换了换行符.
Another solution that actually works fine is to iterate the string by like str.each_line {..}.However this just replaces newline delimiters.
感谢所有这些答案.就我而言,巨大的10 GB STRING"实际上是从互联网上下载的.它包含由特定序列(在大多数情况下为简单的换行符)定界的数据.在我的场景中,我将10 GB文件的每个元素与脚本中已经拥有的另一个(较小的)数据集进行了比较.我感谢所有建议.
Thanks for all those answers.In my case, the "HUGE 10 GB STRING" is actually a download from the internet.It contains data that is delimited by a specific sequence (in most cases a simple newline).In my scenario I compare EACH ELEMENT of the 10 GB file to another (smaller) data-set that I already have in my script. I appreciate all suggestions.
推荐答案
此处是针对实际日志文件的基准.在用于读取文件的方法中,只有使用foreach
的方法才是可伸缩的,因为它避免了对文件的处理.
Here's a benchmark against a real-life log file. Of the methods used to read the file, only the one using foreach
is scalable because it avoids slurping the file.
使用lazy
会增加开销,因此比单独使用map
的时间要慢.
Using lazy
adds overhead, resulting in slower times than map
alone.
请注意,只要处理速度快,foreach
就在其中,并提供可扩展的解决方案. Ruby不在乎文件是几千行还是一千亿TB,它一次只能看到一行.请参阅"为什么拖拽"文件不是一个好习惯? /a>"以获取有关读取文件的一些相关信息.
人们通常倾向于使用一种可一次提取整个文件的内容,然后将其拆分为多个部分.这忽略了Ruby随后必须使用split
或类似方法基于行尾重建数组的工作.这就加起来了,这就是为什么我认为foreach
领先.
require 'benchmark'
REGEX = /\bfoo\z/
LOG = 'debug.log'
N = 1
# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def lazy_map(filename)
File.open("lazy_map.out", 'w') do |fo|
fo.puts File.readlines(filename).lazy.map { |li|
li.gsub(REGEX, 'bar')
}.force
end
end
# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def map(filename)
File.open("map.out", 'w') do |fo|
fo.puts File.readlines(filename).map { |li|
li.gsub(REGEX, 'bar')
}
end
end
# "Reads the entire file specified by name as individual lines, and returns
# those lines in an array."
#
# As a result of returning all the lines in an array this isn't scalable. It
# will work if Ruby has enough memory to hold the array plus all other
# variables and its overhead.
def readlines(filename)
File.open("readlines.out", 'w') do |fo|
File.readlines(filename).each do |li|
fo.puts li.gsub(REGEX, 'bar')
end
end
end
# This is completely scalable because no file slurping is involved.
# "Executes the block for every line in the named I/O port..."
#
# It's slower, but it works reliably.
def foreach(filename)
File.open("foreach.out", 'w') do |fo|
File.foreach(filename) do |li|
fo.puts li.gsub(REGEX, 'bar')
end
end
end
puts "Ruby version: #{ RUBY_VERSION }"
puts "log bytes: #{ File.size(LOG) }"
puts "log lines: #{ `wc -l #{ LOG }`.to_i }"
2.times do
Benchmark.bm(13) do |b|
b.report('lazy_map') { lazy_map(LOG) }
b.report('map') { map(LOG) }
b.report('readlines') { readlines(LOG) }
b.report('foreach') { foreach(LOG) }
end
end
%w[lazy_map map readlines foreach].each do |s|
puts `wc #{ s }.out`
end
Ruby version: 2.0.0
log bytes: 733978797
log lines: 5540058
user system total real
lazy_map 35.010000 4.120000 39.130000 ( 43.688429)
map 29.510000 7.440000 36.950000 ( 43.544893)
readlines 28.750000 9.860000 38.610000 ( 43.578684)
foreach 25.380000 4.120000 29.500000 ( 35.414149)
user system total real
lazy_map 32.350000 9.000000 41.350000 ( 51.567903)
map 24.740000 3.410000 28.150000 ( 32.540841)
readlines 24.490000 7.330000 31.820000 ( 37.873325)
foreach 26.460000 2.540000 29.000000 ( 33.599926)
5540058 83892946 733978797 lazy_map.out
5540058 83892946 733978797 map.out
5540058 83892946 733978797 readlines.out
5540058 83892946 733978797 foreach.out
gsub
的使用是无害的,因为每种方法都使用它,但这不是必需的,并且它是为增加一些琐碎的电阻性负载而添加的.
这篇关于巨大字符串上的Ruby字符串操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!