本文介绍了无监督的自动标注算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想建立一个Web应用程序,可以让用户上传的文件视频图片音乐的,然后给他们一个寻找它们的能力。把它看成是的的Dropbox 的+语义搜索。

I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.

当用户上传一个新的文件,例如 Document1.docx ,我怎么会自动生成基于文件的内容标签?换句话说没有用户输入是需要的确定的文件的内容。如果假设的 Document1.docx 是一家集科研论文数据挖掘,那么当为的数据挖掘研究论文或文档1 的,该文件应在搜索结果中返回,由于数据挖掘研究论文的最有可能是潜在的自动生成的标签,鉴于文件。

When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.

1。哪些算法,你会推荐这个问题?

1. Which algorithms would you recommend for this problem?

2。是否有一个自然语言库,可以为我做?

2. Is there an natural language library that could do this for me?

3。其中机器学习技术,我应该考虑,以提高标签precision?

3. Which machine learning techniques should I look into to improve tagging precision?

4。我怎么能延长这视频和图像自动标注?

4. How could I extend this to video and image automatic tagging?

在此先感谢!

推荐答案

最常见的无监督的机器学习模型,这种类型的任务是隐含狄利克雷分布(LDA)。这种模式在自动根据这些文件的话文档集推断主题的集合。在您所设定的文件将分配的概率为某些话题的话运行LDA当你搜索它们,然后你可以检索具有最高可能性的文件是有关这个词。

The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.

已经有一些扩展,图像和音乐,以及,看到http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.

There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.

LDA有几种语言的几个有效的实现:

LDA has several efficient implementations in several languages:

  • many implementations from the original researchers
  • http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
  • PLDA: a fast, parallelized C++ implementation

这篇关于无监督的自动标注算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-09 06:55