问题描述
我正在尝试编写一个Python函数,该函数给出给定文档文件的路径,并返回该文档中的单词数. .txt文件非常容易做到这一点,并且有一些工具可以让我一起破解一些更复杂的文档格式的支持,但是我想要一个真正全面的解决方案.
I'm trying to write a Python function that, given the path to a document file, returns the number of words in that document. This is fairly easy to do with .txt files, and there are tools that allow me to hack support for a few more complex document formats together, but I want a really comprehensive solution.
查看OpenOffice.org的py-uno脚本接口和受支持的格式列表,将文档加载到无头OOo并调用其单词计数功能似乎是理想的.但是,除了基本的文档生成之外,我找不到任何py-uno教程或示例代码,甚至我发现的代码片段都已经过了半个十年了,不再可用.
Looking at OpenOffice.org's py-uno scripting interface and list of supported formats, it would seem ideal to load the documents in a headless OOo and call its word-count function. However, I can't find any py-uno tutorials or sample code that go beyond basic document generation, and even the code snippets I have found are out of date by a half-decade and no longer work.
无论是否使用OOo和Uno,如何获得各种格式文档的可靠字数统计?
Whether by using OOo and Uno or not, how can I get reliable word-counts for documents of various formats?
推荐答案
将文档加载到无头OOo中 并调用其字数统计功能
PyODConverter 是最近的(11-2009)脚本,可使用OOo转换多个文件类型.查看脚本,它具有所有OOo支持的文档的基本加载.
PyODConverter is a recent (11-2009) script to use OOo to convert multiple file types. Looking at the script, it has basic loading of all the OOo supported documents.
这是您作为无头服务开始OOo的方式:
This is how you start OOo as a headless service:
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
然后,您只需要编写一个小的引导程序,即可在命令行上调用OOo,运行您的脚本,然后关闭OOo.
Then you just have to write a small bootstrapper that calls OOo on the commandline, runs your script, then closes OOo.
这篇关于如何计算复杂文档(.rtf,.doc,.odt等)中的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!