


I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:

  1. 浏览页面的HTML源来自某个站点并理解"哪些部分构成了内容,哪个广告和哪个形成元数据(内容或广告-例如--目录,作者简介等)

  1. Go through the HTML source of pagesfrom a certain site and "understand"which sections form the content,which the advertisements and whichform the metadata ( neither thecontent, nor the ads - for eg. -TOC, author bio etc )


Go through the HTML source of pagesfrom disparate sites and "classify"whether the site belongs to apredefined category or not ( list ofcategories will be suppliedbeforhand )1.


... similar classification tasks ontext and pages.


As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.


As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?


I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?


I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.


My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.


Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!


I will modify this question if needed, depending on all your suggestions and feedback.



Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.


08-21 17:02