问题描述
我正在以相当标准的方式使用Spatie\Crawler
搜寻器软件,例如:
I am using the Spatie\Crawler
crawler software in a fairly standard way, like so:
$client = new Client([
RequestOptions::COOKIES => true,
RequestOptions::CONNECT_TIMEOUT => 10,
RequestOptions::TIMEOUT => 10,
RequestOptions::ALLOW_REDIRECTS => true,
]);
$crawler = new Crawler($client, 1);
$crawler->
setCrawlProfile(new MyCrawlProfile($startUrl, $pathRegex))->
setCrawlObserver(new MyCrawlObserver())->
startCrawling($url);
为了简洁起见,我省略了MyCrawlObserver
类的MyCrawlProfile
的定义,但是无论如何,这是可行的.
I've omitted the definition of the classes MyCrawlProfile
of MyCrawlObserver
for brevity, but anyway, this works as it stands.
我想添加一些中间件以便在发出请求之前更改一些请求,因此我添加了此演示代码:
I want to add some middleware in order to change some requests before they are made, so I added this demo code:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
$stack->push(
Middleware::mapRequest(function (RequestInterface $request) {
echo "Middleware running\n";
return $request;
})
);
$client = new Client([
RequestOptions::COOKIES => true,
RequestOptions::CONNECT_TIMEOUT => 10,
RequestOptions::TIMEOUT => 10,
RequestOptions::ALLOW_REDIRECTS => true,
'handler' => $stack,
]);
// ... rest of crawler code here ...
但是,它是第一个障碍-它会刮除实际上是Location
重定向的站点(/
)的根,然后停止.事实证明,尽管我没有故意删除RedirectMiddleware
,但我现在还是错过了它.
However, it falls on the first hurdle - it scrapes the root of the site (/
) which is actually a Location
redirect, and then stops. It turns out that I am now missing the RedirectMiddleware
despite not having removed it deliberately.
因此,我的问题也通过添加以下内容得以解决:
So, my problem is fixed by also adding this:
$stack->push(Middleware::redirect());
我现在想知道我在创建新的HandlerStack
时意外删除了Guzzle中默认设置的其他内容.饼干?重试机制?其他的东西?我现在不需要这些东西,但是如果我的代码仅修改了现有堆栈,我会对系统的长期可靠性更有信心.
I wonder now what other things are set up by default in Guzzle that I have accidentally removed by creating a fresh HandlerStack
. Cookies? Retry mechanisms? Other stuff? I don't need those things right now, but I'd be a bit more confident about my system's long-term reliability if my code merely modified the existing stack.
Is there a way to do that? As far as I can tell, I'm doing things as per the manual.
推荐答案
$stack = HandlerStack::create();
代替
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
这很重要,因为create()
添加了其他中间件,尤其是对于重定向.
It's important, because create()
adds additional middlewares, especially for redirects.
这篇关于我可以将中间件添加到默认的Guzzle 6 HandlerStack中,而不是创建一个新的堆栈吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!