这是我的代码,用于从CSV文件读取文本并将一列中的所有单词从复数形式转换为单数形式:
import pandas as pd
from textblob import TextBlob as tb
data = pd.read_csv(r'path\to\data.csv')
for i in range(len(data)):
blob = tb(data['word'][i])
singular = blob.words.singularize() # This makes singular a list
data['word'][i] = ''.join(singular) # Converting the list back to a string
但是这段代码现在已经运行了几分钟(如果我不停止的话,可能还要运行几个小时?)!这是为什么?当我逐个检查几个单词时,转换立即发生-完全不需要任何时间。文件中只有1060行(要转换的字)。
编辑:它在大约10-12分钟内完成运行。
以下是一些示例数据:
输入:
word
development
investment
funds
slow
company
commit
pay
claim
finances
customers
claimed
insurance
comment
rapid
bureaucratic
affairs
reports
policyholders
detailed
输出:
word
development
investment
fund
slow
company
commit
pay
claim
finance
customer
claimed
insurance
comment
rapid
bureaucratic
affair
report
policyholder
detailed
最佳答案
那这个呢?
In [1]: import pandas as pd
In [2]: from textblob import Word
In [3]: s = pd.read_csv('text', squeeze=True, memory_map=True)
In [4]: type(s)
Out[4]: pandas.core.series.Series
In [5]: s = s.apply(lambda w: Word(w).singularize())
In [6]: s
Out[6]:
0 development
1 investment
2 fund
3 slow
4 company
5 commit
6 pay
7 claim
8 finance
9 customer
10 claimed
11 insurance
12 comment
13 rapid
14 bureaucratic
15 affair
16 report
17 policyholder
18 detailed
Name: word, dtype: object
我在这里使用
squeeze
让read_csv
返回Series而不是DataFrame,因为word文件只有一列。此外,如果单词文件很大,可以使用memory_map
。您可以使用数据测试性能吗?