本文介绍了r 中 Twitter 情感分析中的表情符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何处理/删除表情符号以便我可以对推文进行排序以进行情感分析?

How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis?

获取:sort.list(y) 中的错误:无效输入

Getting:Error in sort.list(y) : invalid input

谢谢

这就是表情符号从 twitter 和 r 中出现的方式:

and this is how the emoticons come out looking from twitter and into r:

\xed��\xed�\u0083\xed��\xed��
\xed��\xed�\u008d\xed��\xed�\u0089

推荐答案

这应该去掉表情符号,使用 ndoogan 建议的 iconv.

This should get rid of the emoticons, using iconv as suggested by ndoogan.

一些可重复的数据:

require(twitteR)
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem")

# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))

# inspect, yes there are some odd characters in row five
head(df)

                                                                                                                                                text
1                                                                      ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 "@teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons &amp; \nall the other stuff i cant see on android!" \n#Emoticons
3                      E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                                #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5  I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6                                                                                         What you text What I see #Emoticons http://t.co/BKowBSLJ0s

这是删除表情符号的关键行:

# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

现在再次检查,看看奇数字符是否消失(见第 5 行)

Now inspect again, to see if the odd characters are gone (see row 5)

head(df)
                                                                                                                               text
1                                                                     ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 @teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons &amp; \nall the other stuff i cant see on android!" \n#Emoticons
3                     E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                               #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5                                                                                 I use emoticons too much. #addicted #admittingit #emoticons  haha
6                                                                                        What you text What I see #Emoticons http://t.co/BKowBSLJ0s

这篇关于r 中 Twitter 情感分析中的表情符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-25 19:16