


We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.


Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.

我们要到位的话到正确的类别 - 也就是说,我们要弄清楚如何把话说到正确的类别。因此,我们必须懂字,然后把它放在最合适的类别algorthmically。

We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.

即。一家之言是钓鱼类别运动 - >所以这将进入这一类。有话和这样的分类之间有一些重叠,有些话可以进入多个类别。

i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.


If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.


Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?


Even a keyword DB or something might be helpful ? Anyone know of any free ones?



First of all you need sample text to analyze, to get the relationship of words.A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.


A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.


08-13 18:33