问题描述
我有一个类似这样的 csv 文件
I have a csv file something like this
text
RT @CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in @CellCellPress htp://.co/HrjDwbm7NN
RT @gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT @sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT @MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via @nucAmbiguous htp://…
我想从推文文本中提取所有提及(以@"开头).到目前为止,我已经这样做了
I want to extract all the mentions (starting with '@') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^@\w])@(\w{1,25})", str(X.iloc[:i,:]))
print(result);
这里有两个问题:首先:在 str(X.iloc[:1,:])
它给了我 ['CritCareMed']
这不正确,因为它应该给我 ['CellCellPress']
,在 str(X.iloc[:2,:])
它再次给了我 ['CritCareMed']
这当然是又不行了.我得到的最终结果是
There are two problems here:First: at str(X.iloc[:1,:])
it gives me ['CritCareMed']
which is not ok as it should give me ['CellCellPress']
, and at str(X.iloc[:2,:])
it again gives me ['CritCareMed']
which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
它不包括第二行中的提及和最后一行中的两个提及.我想要的应该是这样的:
It doesn't include the mentions in 2nd row and both two mentions in last row.What I want should look something like this:
我怎样才能达到这些结果?这只是一个示例数据,我的原始数据有很多推文,所以方法可以吗?
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
推荐答案
您可以使用 str.findall
方法来避免for循环,使用负向后面替换(^|[^@\w])
形成另一个您在正则表达式中不需要的捕获组:
You can use str.findall
method to avoid the for loop, use negative look behind to replace (^|[^@\w])
which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![@\w])@(\w{1,25})').apply(','.join)
df
# text mention
#0 RT @CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT @gvwilson: Where's the theory for software ... gvwilson
#3 RT @sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT @MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
还有 X.iloc[:i,:]
返回一个数据框,所以 str(X.iloc[:i,:])
给你字符串数据框的表示,与单元格中的元素非常不同,要从text
列中提取实际字符串,可以使用X.text.iloc[0]代码>,或者更好的遍历列的方法,使用
迭代项
:
Also X.iloc[:i,:]
gives back a data frame, so str(X.iloc[:i,:])
gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text
column, you can use X.text.iloc[0]
, or a better way to iterate through a column, use iteritems
:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![@\w])@(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
这篇关于使用 findall python 从推文中提取 @mentions(给出错误的结果)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!