我想分析Excel文件中的文本数据。
我知道如何通过Python读取Excel文件,但是每条数据都成为列表的一个值。但是,我想分析每个单元格中的文本。
这是我的Excel文件示例:
NAME INDUSTRY INFO A FINANCIAL THIS COMPANY IS BLA BLA BLA B MANUFACTURE IT IS LALALALALALALALALA C FINANCIAL THAT IS SOSOSOSOSOSOSOSO D AGRICULTURE WHYWHYWHYWHYWHY
I would like to analyze, say, the financial industry's company info using NLTK, such as the frequency of "IT".
This is what I have so far (yes, it doesn't work!):
import xlrd
aa='c:/book3.xls'
wb = xlrd.open_workbook(aa)
wb.sheet_names()
sh = wb.sheet_by_index(0)
for rownum in range(sh.nrows):
print nltk.word_tokenize(sh.row_values(rownum))
最佳答案
您正在将所有值连续传递给word_tokenize,但您只对第三列中的内容感兴趣。您还在处理标题行。尝试这个:
import xlrd
book = xlrd.open_workbook("your_input_file.xls")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
name, industry, info = sheet.row_values(row_index, end_colx=3)
print "Row %d: name=%r industry=%r info=%r" %
(row_index + 1, name, industry, info)
print nltk.word_tokenize(info) # or whatever else you want to do
关于python - 适用于Excel文件中的NLTK的Python,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/7943145/