国庆第三天2014年10月3日10:21:39,Nutz,WebCollector,jsoup

（1）做得好，做得快，只能选择一样。

（2）时间过得很快，你没法在假期的一天里完成更多的计划。假期全部由自己支配，相对长一点的睡眠，新加入的娱乐（视频或者游戏），你不比在工作中更有效率。

（3）每天练习一点，记录下来。假期来整合优化巩固，是最好的选择。进步每一天。

（4）不要太期待假期。

（5）参照Nutz 入门教程第一讲，做一个小应用。视频不清晰还是看完两遍，还是照着一点点的写出来。视频比较直观，Nutz的文档很详细的，但是任然觉得看视频还是更快，看文档太慢了，或许我看文档的方式要改进下。

（6）JAVA爬虫 WebCollector:

一个读取API的例子：MyParser.java, DocCrawler.java

package demo.hello;

import java.io.UnsupportedEncodingException;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import cn.edu.hfut.dmic.webcollector.model.Link;

import cn.edu.hfut.dmic.webcollector.model.Page;

import cn.edu.hfut.dmic.webcollector.parser.HtmlParser;

import cn.edu.hfut.dmic.webcollector.parser.ParseResult;

public class MyParser extends HtmlParser{

    public MyParser(Integer topN) {

        super(topN);

    }

    @Override

    public ParseResult getParse(Page page) throws UnsupportedEncodingException {

        ParseResult parseResult= super.getParse(page);

        Elements frames=page.getDoc().select("frame[src]");

        for(Element frame:frames){

            Link link=new Link();

            link.setAnchor("");

            link.setUrl(frame.attr("abs:src"));

            parseResult.getParsedata().getLinks().add(link);

        }

        return parseResult;

    }

}

package demo.hello;

import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;

import cn.edu.hfut.dmic.webcollector.model.Page;

import cn.edu.hfut.dmic.webcollector.parser.Parser;

import cn.edu.hfut.dmic.webcollector.util.Config;

public class DocCrawler extends BreadthCrawler{

    @Override

    public Parser createParser(String url, String contentType) throws Exception {

        if(contentType==null)

            return null;

        if(!contentType.contains("text/html"))

            return null;

        return new MyParser(Config.topN);

    }

    public static void main(String[] args) throws Exception{

        DocCrawler crawler=new DocCrawler();

        crawler.addSeed("http://crawlscript.github.io/WebCollectorDoc/");

        crawler.addRegex("http://crawlscript.github.io/WebCollectorDoc.*");

        crawler.setRoot("pages");

        crawler.setThreads(20);

        crawler.start(10);

    }

}

（7）HTML解析器 jsoup
官网、osc简介、