本文介绍了需要将#tags拆分为文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要自动将#tags分割成有意义的单词.

I Need to split #tags to meaningful words in an automated way.

样本输入:

  • iloveusa
  • mycrushlike
  • mydadhero

示例输出

  • 我爱美国
  • 我喜欢的人
  • 我父亲的英雄

我可以使用任何实用程序或开放API来实现这一目标吗?

Any utility or open API that I can use to achieve this?

推荐答案

检查- Norvig 的工作中的20with%20Words.ipynb"rel =" nofollow>分词任务.

Check - Word Segmentation Task from Norvig's work.

from __future__ import division
from collections import Counter
import re, nltk

WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)

def pdist(counter):
    "Make a probability distribution, given evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

P = pdist(COUNTS)

def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return product(P(w) for w in words)

def product(nums):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for x in nums:
        result *= x
    return result

def splits(text, start=0, L=20):
    "Return a list of all (first, rest) pairs; start <= len(first) <= L."
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), L)+1)]

def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

print segment('iloveusa')     # ['i', 'love', 'us', 'a']
print segment('mycrushlike')  # ['my', 'crush', 'like']
print segment('mydadhero')    # ['my', 'dad', 'hero']

要获得比此更好的解决方案,可以使用bigram/trigram.

For better solution than this you can use bigram/trigram.

更多示例,请参见:分词任务

这篇关于需要将#tags拆分为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 15:53