问题描述
我从一个网站上获得了一堆相对较小的页面,想知道是否可以在 Bash 中以某种方式并行执行这些操作.目前我的代码看起来像这样,但执行需要一段时间(我认为拖慢我的是连接延迟).
I am getting a bunch of relatively small pages from a website and was wondering if I could somehow do it in parallel in Bash. Currently my code looks like this, but it takes a while to execute (I think what is slowing me down is the latency in the connection).
for i in {1..42}
do
wget "https://www.example.com/page$i.html"
done
我听说过使用 xargs,但我对此一无所知,而且手册页非常混乱.有任何想法吗?甚至可以并行执行此操作吗?我还有其他方法可以攻击它吗?
I have heard of using xargs, but I don't know anything about that and the man page is very confusing. Any ideas? Is it even possible to do this in parallel? Is there another way I could go about attacking this?
推荐答案
比使用 &
或 -b
将 wget
推入后台更可取code>,你可以使用 xargs
达到同样的效果,而且效果更好.
Much preferrable to pushing wget
into the background using &
or -b
, you can use xargs
to the same effect, and better.
优点是 xargs
将正确同步,无需额外工作.这意味着您可以安全地访问下载的文件(假设没有发生错误).一旦 xargs
退出,所有下载都将完成(或失败),您可以通过退出代码知道是否一切顺利.这比忙于 sleep
等待并手动测试完成要好得多.
The advantage is that xargs
will synchronize properly with no extra work. Which means that you are safe to access the downloaded files (assuming no error occurs). All downloads will have completed (or failed) once xargs
exits, and you know by the exit code whether all went well. This is much preferrable to busy waiting with sleep
and testing for completion manually.
假设 URL_LIST
是一个包含所有 URL 的变量(可以在 OP 示例中使用循环构建,但也可以是手动生成的列表),运行:
Assuming that URL_LIST
is a variable containing all the URLs (can be constructed with a loop in the OP's example, but could also be a manually generated list), running this:
echo $URL_LIST | xargs -n 1 -P 8 wget -q
将一次传递一个参数(-n 1
)给wget
,并且一次最多执行8个并行的wget
进程(-P 8
).xarg
在最后一个生成的进程完成后返回,这正是我们想知道的.不需要额外的诡计.
will pass one argument at a time (-n 1
) to wget
, and execute at most 8 parallel wget
processes at a time (-P 8
). xarg
returns after the last spawned process has finished, which is just what we wanted to know. No extra trickery needed.
我选择的 8 个并行下载的神奇数字"并非一成不变,但这可能是一个很好的折衷方案.最大化"一系列下载有两个因素:
The "magic number" of 8 parallel downloads that I've chosen is not set in stone, but it is probably a good compromise. There are two factors in "maximising" a series of downloads:
一个是填充电缆",即利用可用带宽.假设正常"条件(服务器的带宽比客户端的带宽多),一次或最多两次下载就已经是这种情况.在这个问题上抛出更多连接只会导致数据包被丢弃和 TCP 拥塞控制启动,以及 N 次下载,每个下载具有渐近 1/N 带宽,达到相同的净效果(减去丢弃的数据包,减去窗口大小恢复).数据包被丢弃在 IP 网络中是很正常的事情,这就是拥塞控制应该如何工作(即使只有一个连接),通常影响几乎为零.然而,不合理的大量连接会放大这种效果,因此它会变得很明显.无论如何,它不会使任何事情变得更快.
One is filling "the cable", i.e. utilizing the available bandwidth. Assuming "normal" conditions (server has more bandwidth than client), this is already the case with one or at most two downloads. Throwing more connections at the problem will only result in packets being dropped and TCP congestion control kicking in, and N downloads with asymptotically 1/N bandwidth each, to the same net effect (minus the dropped packets, minus window size recovery). Packets being dropped is a normal thing to happen in an IP network, this is how congestion control is supposed to work (even with a single connection), and normally the impact is practically zero. However, having an unreasonably large number of connections amplifies this effect, so it can be come noticeable. In any case, it doesn't make anything faster.
第二个因素是连接建立和请求处理.在这里,在飞行中增加一些额外的转机确实有帮助.一个面临的问题是两次往返的延迟(通常在同一地理区域内为 20-40 毫秒,洲际为 200-300 毫秒)加上服务器实际需要处理请求并推送回复的奇数 1-2 毫秒到插座.这不是很多时间本身,但是乘以几十/千个请求,它很快就会加起来.
处理中从六个到十几个请求隐藏了大部分或全部延迟(它仍然存在,但由于它重叠,它没有总结!).同时,只有少数并发连接不会产生负面影响,例如导致过度拥塞,或迫使服务器分叉新进程.
The second factor is connection establishment and request processing. Here, having a few extra connections in flight really helps. The problem one faces is the latency of two round-trips (typically 20-40ms within the same geographic area, 200-300ms inter-continental) plus the odd 1-2 milliseconds that the server actually needs to process the request and push a reply to the socket. This is not a lot of time per se, but multiplied by a few hundred/thousand requests, it quickly adds up.
Having anything from half a dozen to a dozen requests in-flight hides most or all of this latency (it is still there, but since it overlaps, it does not sum up!). At the same time, having only a few concurrent connections does not have adverse effects, such as causing excessive congestion, or forcing a server into forking new processes.
这篇关于Bash 中的并行 wget的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!