问题描述
作为一个较大的项目的一部分,我需要阅读文本并将每个单词表示为一个数字.例如,如果程序读取每个男孩都应得到水果",那么我会得到一个表,该表将" 每个 "转换为" 1742 "," 好 "到" 977513 "等
As part of a larger project, I need to read in text and represent each word as a number. For example, if the program reads in "Every good boy deserves fruit", then I would get a table that converts 'every' to '1742', 'good' to '977513', etc.
现在,显然我可以使用哈希算法来获取这些数字.但是,如果具有相似含义的单词的数值彼此接近,则将使 good 变为" 6827 "会更有用.并且' 伟大 '变为' 6835 ',依此类推.
Now, obviously I can just use a hashing algorithm to get these numbers. However, it would be more useful if words with similar meanings had numerical values close to each other, so that 'good' becomes '6827' and 'great' becomes '6835', etc.
作为另一种选择,最好使用由多个数字组成的向量,而不是用简单的整数表示每个数字,例如( lexical_category , tense , classification , specific_word ),其中 lexical_category 是名词/动词/形容词/等,时态是future/past/目前,分类定义了一系列通用主题,并且 specific_word 与上一段中的描述非常相似.
As another option, instead of a simple integer representing each number, it would be even better to have a vector made up of multiple numbers, eg (lexical_category, tense, classification, specific_word) where lexical_category is noun/verb/adjective/etc, tense is future/past/present, classification defines a wide set of general topics and specific_word is much the same as described in the previous paragraph.
是否存在这样的算法?如果没有,您能给我一些如何开始自我发展的提示吗?我用C ++编写代码.
Does any such an algorithm exist? If not, can you give me any tips on how to get started on developing one myself? I code in C++.
推荐答案
要将单词映射到数字,您可能应该只使用索引.使用哈希码只会带来麻烦,因为完全不相关的单词最终可能会使用相同的值.
To map a word to a number, you should probably just use an index. Using hashcodes is just asking for trouble, since completely unrelated words could end up using the same value.
有多种方法可以对语义相关单词的数值进行度量,例如潜在语义分析(LSA)或在词法资源(例如 WordNet )中使用某种相关性度量(例如 Lin , Resnik 或江康拉).
There are a number of ways to get a numerical measure of how semantically related words are, such as latent semantic analysis (LSA) or using some measure of relatedness within a lexical resource like WordNet (e.g. Lin, Resnik, or Jiang-Conrath).
要获得所谓的词法类别,您需要使用一部分,语音(POS)标记程序. POS标签还会为您提供时态信息(例如,VBP表示该单词是过去时态动词).
To get what you're calling lexical categories, you'll need to use a part-of-speech (POS) tagger. The POS tags will also give you tense information (e.g., VBP means the word is a past tense verb).
要将单词分配给主题,您可以使用WordNet的 hypernym信息.这会给您一些东西,例如红色"是颜色".或者,如果您愿意的话,可以使用潜在的狄利克雷分配(LDA)将单词更轻松地分配给主题,以便每个单词都可以不同程度地分配给众多主题.
To assign words to topics, you could make use of hypernym information from WordNet. This will give you stuff like 'red' is a 'color'. Or, you could make use of Latent Dirichlet allocation (LDA), if you would like to have a softer assignment of words to topics such that each word can be assigned to numerous topics to varying degrees.
这篇关于根据定义将单词映射到数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!