问题描述
我正在考虑一个项目,其中该地区内容的相关公开发布的推文会增加出版物的内容.但是我如何以编程方式找到相关的推文?我知道生成代表自然语言含义的结构几乎是NLP的圣杯,但是也许我可以使用某些工具至少将其范围缩小一点?
I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit?
或者,我可以只使用标签.但这需要代表用户做更多的工作.我对Twitter不太熟悉-大多数人都使用主题标签(即使是较小规模的问题),还是依靠它们会切断很大一部分数据?
Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super familiar with Twitter - do most people use hashtags (even for smaller scale issues), or would relying on them cut off a large segment of data?
我也有兴趣获取Facebook的状态(当然,在获得海报允许的情况下),并且在Facebook上很少使用井号标签.
I'd also be interested in grabbing Facebook statuses (with permission from the poster, of course), and hashtag use is pretty rare on Facebook.
我可以使用简单的关键字搜索来粗略地缩小范围,但这很可能需要人工干预才能确定哪些推文实际上应与内容一起发布.
I could use simple keyword search to crudely narrow the field, but that's more likely to require human intervention to determine which tweets should actually be posted alongside the content.
想法?这是以前做过的吗?
Ideas? Has this been done before?
推荐答案
找到与您的内容相关的推文有两种简单的方法.第一种方法是将其视为受监管文档分类任务,通过该任务您将训练分类器,以使用一组特定的预定主题标签来对推文进行注释.然后,您可以使用标签来选择适合您要扩充的内容的推文.如果您不喜欢使用预定的主题集,则另一种方法是简单地根据推文与内容的语义重叠对其打分.然后,您可以显示具有最多语义重叠的顶部 n 条推文.
There are two straightforward ways to go about finding tweets relevant to your content.The first would be to treat this as a supervised document classification task, whereby you would train a classifier to annotate tweets with a certain predetermined set of topic labels. You could then use the labels to select tweets that are appropriate for whatever content you'll be augmenting. If you don't like using a predetermined set of topics, another approach would be to simply score tweets according to their semantic overlap with your content. You could then display the top n tweets with the most semantic overlap.
受监管文件分类
使用监督性文档分类将要求您具有一组训练有素的推文,这些推文中标有将要使用的主题集.例如
Using supervised document classification would require that you have a training set of tweets labeled with the set of topics you'll be using. e.g.,
推文: NBA总决赛震撼了标签:体育
推文:Google员工现在可以使用Ruby! 标签:编程
推文:吃午餐标签:其他
tweet: NBA finals rocked label: sports
tweet: Googlers now allowed to use Ruby! label: programming
tweet: eating lunch label: other
如果要收集培训数据而不必手动为主题添加标签,则可以使用井号将主题标签分配给该主题.主题标签可以与主题标签相同,或者您可以编写规则以将带有某些主题标签的推文映射到所需标签.例如,标记为#NFL
或#NBA
的推文都可以被分配一个标签sports
.
If you want to collect training data without having to manually label the tweets with topics, you could use hashtags to assign topic labels to the tweets. The hashtags could be identical with the topic labels, or you could write rules to map tweets with certain hashtags to the desired label. For example, tweets tagged either #NFL
or #NBA
could all be assigned a label of sports
.
一旦您具有按主题标记的推文,就可以使用任意数量的现有软件包来训练将标签分配给新推文的分类器.一些可用的软件包包括:
Once you have the tweets labeled by topic, you can use any number of existing software packages to train a classifier that assigns labels to new tweets. A few available packages include:
- NLTK (Python)-请参见NLTK书中第6章学习文本分类
- Classifier4J (Java)
- nBayes (C#)
- NLTK (Python) - see Chapter 6 in the NLTK book on Learning to Classify Text
- Classifier4J (Java)
- nBayes (C#)
语义重叠
使用推文与内容的语义重叠来查找推文,从而无需标签化的训练集.估算内容和所评分的推文之间语义重叠的最简单方法是使用 向量空间模型 .为此,请将您的文档和每个推文表示为一个向量,向量中的每个维都对应一个单词.然后,分配给每个向量位置的值表示该单词对文档含义的重要性.一种估计方式是简单地使用单词在文档中出现的次数.但是,使用 TF/IDF ,可以对稀有词进行加权,而对较常见的词进行加权.
Finding tweets using their semantic overlap with your content avoids the need for a labeled training set. The simplest way to estimate the semantic overlap between your content and the tweets that you're scoring is to use a vector space model. To do this, represent your document and each tweet as a vector with each dimension in the vector corresponding to a word. The value assigned to each vector position then represents how important that word is to the meaning of document. One way to estimate this would be to simply use the number of times the word occurs in the document. However, you'll likely get better results by using something like TF/IDF, which up-weights rare terms and down-weights more common ones.
一旦您将内容和推文表示为矢量,就可以通过使用您内容的向量和每个tweet的向量的余弦相似度 .
Once you've represented your content and the tweets as vectors, you can score the tweets by their semantic similarity to your content by taking the cosine similarity of the vector for your content and the vector for each tweet.
无需自己编写任何代码.您可以仅使用Classifier4J之类的包,其中包含 VectorClassifier 类别,可使用向量空间模型对文档的相似性进行评分.
There's no need to code any of this yourself. You can just use a package like Classifier4J, which includes a VectorClassifier class that scores document similarity using a vector space model.
更好的语义重叠
在每个维度上使用一个术语的向量空间模型可能会遇到的一个问题是,它们不能很好地处理含义相同的不同单词.例如,这样的模型会说The small automobile
和A little car
之间没有相似性.
One problem you might run into with vector space models that use one term per dimension is that they don't do a good job of handling different words that mean roughly the same thing. For example, such a model would say that there is no similarity between The small automobile
and A little car
.
还有更复杂的建模框架,例如 潜在语义分析(LSA) 和 潜在的狄利克雷分配(LDA) 用于构造相互比较的文档的更抽象的表示形式.可以将此类模型视为不是基于简单的单词重叠而是根据单词的基本含义的重叠来对文档进行评分.
There are more sophisticated modeling frameworks such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) that can be used to construct more abstract representations of the documents being compared to each other. Such models can be thought of as scoring documents not based on simple word overlap, but rather in terms of overlap in the underlying meaning of the words.
在软件方面,软件包语义向量提供了可扩展的类似LSA的框架文档相似性.对于LDA,您可以使用 David Blei的实现或斯坦福主题模型工具箱.
In terms of software, the package Semantic Vectors provides a scalable LSA-like framework for document similarity. For LDA, you could use David Blei's implementation or the Stanford Topic Modeling Toolbox.
这篇关于从Twitter状态获取意图的工具?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!