问题描述
我想教给自己足够的机器学习知识,以便我可以首先了解足够的知识,以使用可用的开源ML框架,该框架将使我能够执行以下操作:
I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
-
浏览页面的HTML源来自某个站点并理解"哪些部分构成了内容,哪个广告和哪个形成元数据(内容或广告-例如--目录,作者简介等)
Go through the HTML source of pagesfrom a certain site and "understand"which sections form the content,which the advertisements and whichform the metadata ( neither thecontent, nor the ads - for eg. -TOC, author bio etc )
浏览页面的HTML源来自不同的网站并分类"该网站是否属于是否预定义类别(列表的将提供类别beforhand)1.
Go through the HTML source of pagesfrom disparate sites and "classify"whether the site belongs to apredefined category or not ( list ofcategories will be suppliedbeforhand )1.
...类似的分类任务文字和页面.
... similar classification tasks ontext and pages.
如您所见,我的直接要求是对不同数据源和大数据量进行分类.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
据我有限的理解,与使用SVM相比,采用神经网络方法将需要大量的培训和维护?
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
我了解SVM非常适合像我这样的(二进制)分类任务,并且像libSVM这样的开源framworks相当成熟吗?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
我想远离Java,这是可能的,否则我没有语言偏好.我愿意学习并尽我所能.
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
我的目的不是从头开始编写代码,而是从放置各种可用的框架开始(尽管我不太了解决定使用哪个框架),并且我应该能够解决这些问题出错.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
您对学习统计和概率论的特定部分的建议对我而言并不是意料之外的,所以说,如果需要的话!
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
根据您的所有建议和反馈,我将根据需要修改此问题.
I will modify this question if needed, depending on all your suggestions and feedback.
推荐答案
对我来说,这似乎是一项非常复杂的任务;步骤2,分类是容易的",但步骤1似乎是一个结构学习任务.您可能希望将其简化为对HTML树的某些部分进行分类,可能是通过一些启发式方法预先选择的.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
这篇关于计算机科学专业的毕业生需要学习哪些主题,主题,尤其是应用可用的机器学习框架.支持向量机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!