问题描述
我有一个文本文件,使用以下命令转换为数据框:
df = pd.read_csv(C: \\Users\\Sriram\\Desktop\\New文件夹(4)\\aclImdb\\test\\result.txt,sep ='\t' ,
names = ['评论','极性']
这里的评论栏包括所有的电影评论和极性列包括评论是正面还是负面。
我有以下功能功能,我的评论栏(近1000评论)从数据框需要传递。
def find_features(document):
words = word_tokenize(document)
features = {}
for word in word_features:
features [w] =(w in words)
return features
我正在使用以下功能创建一个训练数据集。
trainsets = [find_features(df.reviews),df.polarity]
/ pre>
因此,通过这样做,由于find_feature中的tokenize函数,我的评论列中的所有单词将被分割,并将分配一个极性(正或负)
例如:
评论极限
这是电影负的一个糟糕的借口
对于上述情况,调用find_features函数后,如果方法里面的函数是满意的,我将得到的输出为:
poor - negative
excuse - negative
等等....
我试图调用这个函数,我得到以下错误:
----------- -------------------------------------------------- --------------
TypeError Traceback(m最近的最后一次呼叫)
< ipython-input-79-76f9090c0532>在< module>()
30返回功能
31
---> 32个featureets = [find_features(df.reviews),df.polarity]
33 #featuresets = [(find_features(rev),category)for((rev,category))in
reviews]
34'''
< ipython-input-79-76f9090c0532>在find_features(文件)
24
25 def find_features(document):
---> 26个词= word_tokenize(document)
27 features = {}
28在word_features中的w:
C:\Users\Sriram\Anaconda3\lib\
word_tokenize(文本,语言)
102:param语言:Punkt语料库中的模型名称
103
- > 104返回[发送在sent_tokenize(文本,语言)中的令牌)
105为_treebank_word_tokenize(发送)中的令牌
106
C:\
sent_tokenize(文本,语言)
87
88用户\Sriram\Anaconda3\lib\site-packages\\\
ltk\tokenize\__ tokenizer = load('tokenizers / punkt / {0} .pickle'.format(language))
---> 89 return tokenizer.tokenize(text)
90
91#标准字tokenizer。
C:\Users\Sriram\Anaconda3\lib\site-packages\\\
ltk\tokenize\punkt.py
tokenize(self,text,realign_boundaries )
1224给定一个文本,返回该文本中的句子的列表。
1225
- > 1226返回列表(self.sentences_from_text(text,
realign_orders))
1227
1228 def debug_decisions(self,text)
C:\Users\Sriram\Anaconda3\lib\site-packages\\\
ltk\tokenize\punkt.py
sentence_from_text(self,text,realign_orders )
1272遵循期限
1273
- > 1274 return [text [s:e] for s,e in self.span_tokenize(text,
realign_orders)]
1275
1276 def _slices_from_text(self,text):
C:\Users\Sriram\Anaconda3\lib\site-packages\\\
ltk\tokenize\punkt.py
span_tokenize(self,text,realign_orders)
1263 if realign_orders:
1264 slices = self._realign_boundaries(text,slices)
- > 1265 return [(sl.start,sl.stop)for sl in slices]
1266
1267 def sentence_from_text(self,text,realign_boundaries = True):
C:\\ $ usb
$ b $ 12 $如果realign_boundaries:$ b,则$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b $ b 1264 slices = self._realign_boundaries(text,slices)
- > 1265 return [(sl.start,sl.stop)for sl in slices]
1266
1267 def sentence_from_text(self,text,realign_boundaries = True):
C:\\
_realign_boundaries(self,text,slices)中的\\Users\Sriram\Anaconda3\lib\site-packages\\\
ltk\tokenize\punkt.py
1302
1303 realign = 0
- > 1304 for sl1,sl2 in _pair_iter(slices):
1305 sl1 = slice(sl1.start + realign,sl1.stop)
1306如果不是sl2:
C:\Users\Sriram\Anaconda3\lib\site-packages\\\
ltk\tokenize\punkt.py在
_pair_iter(it)
308
309 it = iter(it)
- > 310 prev = next(it)
311对于el:
312 yield(prev,el)
C:\Users\Sriram\Anaconda3\lib $ {$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b $ b - > 1278匹配在
self._lang_vars.period_context_re()。finditer(text):
1279 context = match.group()+ match.group('after_tok')
1280如果自己。 text_contains_sentbreak(context):
TypeError:预期的字符串或类似字节的对象
如何直接从具有多行值的数据框中调用函数(在我的情况评论中)?
解决方案您提供的预期输出:
差 - 负
借口 - 负
我会建议:
trainsets = df.apply(lambda row:([(kw,row.polarity)for kw in find_features(row.reviews)]),axis = 1)
添加一个示例代码段:
import pandas as pd
pre>
from StringIO import StringIO
print'pandas-version:',pd .__ version__
data_str =
col1,col2
'leoperd lion tige '''$'
'''
data_str = StringIO(data_str )
#一个具有2列的数据框
df = pd.read_csv(data_str)
#一个虚函数,从每行
#取一个col1值,并将其拆分变成多个值&返回一个列表
def my_fn(row_val):
return row_val.split('')
#调用逐行应用数据帧上的vetor操作
train_set = df对于my_fn(row.col1)中的kw,用于()的行:([行(kw,row.col2)]),轴= 1)
print train_set
输出:
熊猫版本:0.15.2
0 [('leoperd,'non-veg'),(lion,'non veg'),(ti ...
1 [('buffalo,'veg'),(羚羊, veg'),(elepha ...
2 [('dog,'all'),(cat,'all'),(乌鸦','all')]
dtype:object
@SriramChandramouli,希望我能正确理解您的要求。
I have a text file which was converted to dataframe using below command:
df = pd.read_csv("C:\\Users\\Sriram\\Desktop\\New folder (4)\\aclImdb\\test\\result.txt", sep = '\t', names=['reviews','polarity'])
Here the reviews column consists of all the movie reviews and polarity column consists of whether the review is positive or negative.
I have below feature function, to which my reviews column (nearly 1000 reviews) from dataframe needs to be passed.
def find_features(document): words = word_tokenize(document) features = {} for w in word_features: features[w] = (w in words) return features
I am creating a training dataset using below function.
trainsets = [find_features(df.reviews), df.polarity]
Hence by doing this, all the words in my reviews column will be split as a result of tokenize function in find_feature and will be assigned a polarity (positive or negative).
For example:
reviews polarity This is a poor excuse for a movie negative
For above case, after calling the find_features function, if the method inside the function is satisfied, I will be getting output as:
poor - negative excuse - negative
and so on....
While I am trying to call this function, I am getting the below error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-79-76f9090c0532> in <module>() 30 return features 31 ---> 32 featuresets = [find_features(df.reviews), df.polarity] 33 #featuresets = [(find_features(rev), category) for ((rev, category)) in reviews] 34 ''' <ipython-input-79-76f9090c0532> in find_features(document) 24 25 def find_features(document): ---> 26 words = word_tokenize(document) 27 features = {} 28 for w in word_features: C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in word_tokenize(text, language) 102 :param language: the model name in the Punkt corpus 103 """ --> 104 return [token for sent in sent_tokenize(text, language) 105 for token in _treebank_word_tokenize(sent)] 106 C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language) 87 """ 88 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) ---> 89 return tokenizer.tokenize(text) 90 91 # Standard word tokenizer. C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries) 1224 Given a text, returns a list of the sentences in that text. 1225 """ -> 1226 return list(self.sentences_from_text(text, realign_boundaries)) 1227 1228 def debug_decisions(self, text): C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries) 1272 follows the period. 1273 """ -> 1274 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1275 1276 def _slices_from_text(self, text): C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries) 1263 if realign_boundaries: 1264 slices = self._realign_boundaries(text, slices) -> 1265 return [(sl.start, sl.stop) for sl in slices] 1266 1267 def sentences_from_text(self, text, realign_boundaries=True): C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0) 1263 if realign_boundaries: 1264 slices = self._realign_boundaries(text, slices) -> 1265 return [(sl.start, sl.stop) for sl in slices] 1266 1267 def sentences_from_text(self, text, realign_boundaries=True): C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices) 1302 """ 1303 realign = 0 -> 1304 for sl1, sl2 in _pair_iter(slices): 1305 sl1 = slice(sl1.start + realign, sl1.stop) 1306 if not sl2: C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it) 308 """ 309 it = iter(it) --> 310 prev = next(it) 311 for el in it: 312 yield (prev, el) C:\Users\Sriram\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) 1276 def _slices_from_text(self, text): 1277 last_break = 0 -> 1278 for match in self._lang_vars.period_context_re().finditer(text): 1279 context = match.group() + match.group('after_tok') 1280 if self.text_contains_sentbreak(context): TypeError: expected string or bytes-like object
How to call a function directly from a dataframe which has multiple rows of values (In my case reviews)?
解决方案going by your expected output mentioned:
poor - negative excuse - negative
I will suggest:trainsets = df.apply(lambda row: ([(kw, row.polarity) for kw in find_features(row.reviews)]), axis=1)
adding a sample snippet for ref:
import pandas as pd from StringIO import StringIO print 'pandas-version: ', pd.__version__ data_str = """ col1,col2 'leoperd lion tiger','non-veg' 'buffalo antelope elephant','veg' 'dog cat crow','all' """ data_str = StringIO(data_str) # a dataframe with 2 columns df = pd.read_csv(data_str) # a dummy function taking a col1 value from each row # and splits it into multiple values & returns a list def my_fn(row_val): return row_val.split(' ') # calling row-wise apply vetor operation on dataframe train_set = df.apply(lambda row: ([(kw, row.col2) for kw in my_fn(row.col1)]), axis=1) print train_set
output:
pandas-version: 0.15.2 0 [('leoperd, 'non-veg'), (lion, 'non-veg'), (ti... 1 [('buffalo, 'veg'), (antelope, 'veg'), (elepha... 2 [('dog, 'all'), (cat, 'all'), (crow', 'all')] dtype: object
@SriramChandramouli, hope I understood your requirement correctly.
这篇关于Python:获取TypeError:调用函数时期望的字符串或类似字节的对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!