问题描述
我对Stack Overflow的第一篇文章,请保持温柔!我将为客户启动一个新的Ruby on Rails(3.1)项目.他们的要求之一是要有一个搜索引擎,该引擎将为大约2,000个包含PDF,Word,Excel和HTML的文档编制索引.
My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.
我曾经希望使用思想狮身人面像或Texticle(在 https:/上最受欢迎/www.ruby-toolbox.com/categories/rails_search.html ),但据我了解:
I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:
- 文章需要PostgreSQL.我在使用MySQL.
- thinking-sphinx不会为文件系统上的文件建立索引.
- 即使我将附件保存到数据库中,thinking-sphinx仍然无法使用,因为它需要纯文本(根据 http://groups.google.com/group/thinking-sphinx/browse_thread/thread/69cdc1c8e1c096ff )
所以我有两个选择:
- 选择其他搜索工具
- 尝试将附件的纯文本版本提取到数据库中以供思想狮身人面像阅读
您推荐哪种方法?
如果是其他搜索工具,哪个?我的要求非常基础,因此我非常希望它的设置非常简单,并且有大量的文档,示例和教程!
If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!
如果要提取,是否可以为常见文件类型(例如PDF,Word,Excel和HTML)推荐提取器?
If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?
谢谢大家.非常感谢您的帮助.
Thanks everyone. Really appreciate your help.
推荐答案
只需进行更新即可.我决定采用的方法是:
Just to update this. The approach I've decided to go with is:
尝试将附件的纯文本版本提取到数据库中以供思想狮身人面像阅读
具体地说,我将执行以下操作:
Specifically, I'll be doing the following:
- 使用思想狮身人面像
- 使用 subexec gem 来调用...
- ... Tika 从命令行
- Using thinking-sphinx
- Using the subexec gem to call ...
- ... Tika from the command line
看起来就像调用java -jar tika-app-0.10.jar -t [file]
一样简单,但是如果结果变得更复杂,我将发布我的经验!
It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file]
but I'll post my experiences if it turns out to be more complicated!
这篇关于从Rails应用程序中搜索附件(Word,PDF,Excel等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!