问题描述
我将通过golang旅游,并在最后一个练习中更改一个网页抓取工具,以并行抓取,而不是重复抓取()。所有我改变的是抓取功能。
var used = make(map [string] bool)
func Crawl(url string,depth int,fetcher Fetcher){
if depth< = 0 {
return
}
body,urls,err:= fetcher.Fetch (url)
if err!= nil {
fmt.Println(err)
return
}
fmt.Printf(\\\
found:%s%q \ n \\\
,url,body)
for _,u:= range urls {
if used [u] == false {
used [u] = true
Crawl(u,depth-1,fetcher)
}
}
return
}
为了使它并发我在函数Crawl的调用之前添加了go命令,而不是递归调用Crawl函数,程序只找到页面,没有其他页面。
为什么当我将go命令添加到Crawl函数的调用时,程序不工作?
是的,你的进程退出之前所有的URL可以跟踪
的抓取工具。由于并发性, main()
过程在
之前退出,工人完成。
为了避免这种情况,您可以使用:
func抓取(url string,depth int,fetcher Fetcher,wg * sync.WaitGroup){
pre>
defer wg.Done()
if depth< = 0 {
return
}
body,urls,err:= fetcher.Fetch(url)
if err!= nil {
fmt.Println(err)
return
}
fmt.Printf(\\\
found:%s%q\\\
\\\
url,body)
for _,u:= range urls {
if used [u] == false {
used [u] = true
wg.Add(1)
go Crawl(u,depth-1,fetcher,wg)
}
}
return
}
并在
main
中调用Crawl
,如下所示:func main(){
wg:=& sync.WaitGroup {}
Crawl(http://golang.org/,4,fetcher,wg)
wg.Wait()
}
此外,。
I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.
var used = make(map[string]bool) func Crawl(url string, depth int, fetcher Fetcher) { if depth <= 0 { return } body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) return } fmt.Printf("\nfound: %s %q\n\n", url, body) for _,u := range urls { if used[u] == false { used[u] = true Crawl(u, depth-1, fetcher) } } return }
In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.
Why doesn't the program work when I add the go command to the call of the function Crawl?
解决方案The problem seems to be, that your process is exiting before all URLs can be followedby the crawler. Because of the concurrency, the
main()
procedure is exiting beforethe workers are finished.To circumvent this, you could use
sync.WaitGroup
:func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) { defer wg.Done() if depth <= 0 { return } body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) return } fmt.Printf("\nfound: %s %q\n\n", url, body) for _,u := range urls { if used[u] == false { used[u] = true wg.Add(1) go Crawl(u, depth-1, fetcher, wg) } } return }
And call
Crawl
inmain
as follows:func main() { wg := &sync.WaitGroup{} Crawl("http://golang.org/", 4, fetcher, wg) wg.Wait() }
Also, don't rely on the map being thread safe.
这篇关于练习:Web Crawler - 并发不工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!