问题描述
似乎MapReduce框架的本质是要处理许多文件.因此,当我收到告诉我使用的文件过多的错误时,我怀疑我在做错什么.
It seems like the nature of the MapReduce framework is to work with many files. So when I get errors that tell me I'm using too many files, I suspect I'm doing something wrong.
如果我使用inline
运行器和三个目录运行该作业,则它可以正常工作:
If I run the job with the inline
runner and three directories, it works:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
但是,如果我使用local
运行程序(以及相同的三个目录)运行它,则会失败:
But if I run it using the local
runner (and the same three directories), it fails:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/
[...output clipped...]
> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
self._invoke_step(step_num, 'mapper')
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
procs_args, output_path, working_dir, env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
cwd=working_dir, env=env)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
proc = Popen(args, **proc_kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
errpipe_read, errpipe_write = self.pipe_cloexec()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
r, w = os.pipe()
OSError: [Errno 24] Too many open files
此外,如果我回到使用内联运行器并在输入中包含更多目录(总共11个),那么我会再次遇到另一个错误:
Furthermore, if I go back to using the inline runner and include even more directories (11 total) in my input, then I get a different error again:
$ python mr_gps_quality.py /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/
[...clipped...]
Traceback (most recent call last):
File "mr_gps_quality.py", line 53, in <module>
MRGPSQuality.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
self._invoke_sort(self._step_input_paths(), sort_output_path)
File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
check_call(args, stdout=output, stderr=err, env=env)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
retcode = call(*popenargs, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
return Popen(*popenargs, **kwargs).wait()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long
mrjob文档包括讨论了inline
和local
跑步者,但我不知道它如何解释这种行为.
The mrjob docs include a discussion of the differences between the inline
and local
runners, but I don't understand how it would explain this behavior.
最后,我要提到的是,我正在遍历的目录中的文件数量不是很大(确认) :
Lastly, I'll mention that the number of files in the directories I'm globbing isn't huge (acknowledgement):
$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do printf "%-25.25s : " "$dir"; find "$dir" -type f | wc -l; done | sort
./01 : 236
./02 : 169
./03 : 176
./04 : 185
./05 : 176
./06 : 235
./07 : 275
./08 : 265
./09 : 186
./10 : 171
./11 : 161
我认为这与工作本身无关,但是在这里:
I don't think this has to do with the job itself, but here it is:
from mrjob.job import MRJob
import numpy as np
import geohash
class MRGPSQuality(MRJob):
def mapper(self, _, line):
try:
lat = float(line.split(',')[1])
lng = float(line.split(',')[2])
horizontalAccuracy = float(line.split(',')[4])
gh = geohash.encode(lat, lng, precision=7)
yield gh, horizontalAccuracy
except:
pass
def reducer(self, key, values):
# Convert the generator straight back to array:
vals = np.fromiter(values, float)
count = len(vals)
mean = np.mean(vals)
if count > 50:
yield key, [count, mean]
if __name__ == '__main__':
MRGPSQuality.run()
推荐答案
参数列表过长"的问题不是作业或python,而是bash.命令行中用于启动作业的星号会扩展到与之匹配的每个文件,这是一个非常长的命令行,并且超过了bash限制.
The problem for "Argument list too long" is not the job or python, its bash. The asterisk in your command line to kick off the job expands out to every file that matches which is a really long command line and exceeds bash limit.
该错误与ulimit无关,但错误与打开的文件过多"与ulimit有关,因此,如果命令实际上要运行,则会遇到ulimit.
The error has nothing to do with ulimit but the error "Too many open files" is to do with ulimit, so you bump into the ulimit if the command were to actually run.
您可以像这样检查炮弹极限(如果您有兴趣)... getconf ARG_MAX
You can check the shells limit like this (if you are interested)... getconf ARG_MAX
要解决最大args问题,您可以通过执行以下操作将所有文件连接为一个文件.
To get around the max args problem, you can concatenate all the files into one by doing this.
for f in *; do cat "$f" >> ../directory/bigfile.log; done
然后运行指向大文件的mrjob.
Then run your mrjob pointed at the big file.
如果文件很多,则可以使用gnu parallel使用多个线程来连接文件,因为上述命令是单线程且速度较慢.
If its a lot of files you can use multiple threads to concat the file using gnu parallel because above command is single thread and slow.
ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"
*将8更改为所需的并行度
*Change 8 to the amount of parallelism you want
这篇关于为什么在使用mrjob v0.4.4时出现[Errno 7]参数列表太长和OSError:[Errno 24]打开的文件太多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!