问题描述
我很想知道Stack Overflow的标签和搜索是如何构建的,因为它似乎工作得很好。如果我想要执行以下所有操作,那么什么是好的数据库/搜索模型:
- 在各种实体上存储标签(如何规范化,即Entity,Tag和Entity_Tag表?)
- 搜索具有特定标签的项目
- 构建适用于特定搜索结果集的所有标签的标签云
- 如何显示标签列表搜索结果中的每个项目?
也许存储标签以标准化的形式,但也作为空格分隔的字符串,用于#2,#4和可能#3。想法?
我听说过Stack Overflow使用Lucene进行搜索。真的吗?我听说过几个讨论SQL优化的播客,但没有关于Lucene。如果他们使用Lucene,我想知道有多少搜索结果来自Lucene,以及下钻标签云是否来自Lucene。
哇,我刚刚写了一个大帖子,SO cho cho and and and,,,,,,。。。。。,。。,,,,,,,,,,。 aaargh。
所以这里我再次去...
关于堆栈溢出,事实证明,他们使用。
关于@Grant推荐的操作系统项目:
- DotNetKicks 使用DB进行标记和Lucene进行全文搜索。似乎无法将全文搜索与标签搜索结合使用。
- Kigg 对搜索和标签查询使用Linq-to-SQL。这两个查询都加入了Stories-> StoryTags->标签。
- 这两个项目都有一个3表格的方法来标记,因为每个人都喜欢推荐
我还发现我以前想过的一些其他问题:
我目前正在为每个提到的项目做些什么:
- 在DB中,3个表:Entity,Tag,Entity_Tag。我使用数据库:
- 构建站点范围的标签云
- 按标签浏览(即像SO的 /questions/tagged/ASP.NET )
- 对于搜索,我使用Lucene + NHibernate.Search
- 标签连接到由Lucene索引的TagString
- 所以我有Lucene查询引擎的全部功能(AND / OR / NOT queries)
- 我可以同时按标签过滤和
- Lucene分析器将字符合并到更好的标签搜索中(即,标签搜索测试也会找到标记为测试的东西)
- Lucene返回一个潜在的巨大结果集,我分页到20个结果
- 然后NHibernate加载结果Ent id由ID或者从DB或实体缓存
- 所以完全有可能搜索结果是0到达数据库
- 标签连接到由Lucene索引的TagString
- 不这样做但是,我想我可能会尝试从Lucene的TagString中找到一个构建标签云的方法,而不是采用另一个数据库命中。
- 还没有这样做,但是我可能会将TagString存储在数据库中,以便我可以显示一个实体的标签列表,而无需再进行两次连接。
意味着每当实体的标签被修改时,我必须:
- 插入任何不存在的新标签
- 从EntityTag表插入/删除
- 更新Entity.TagString
- 更新实体的Lucene索引
鉴于我的应用程序读取与写入的比例非常大,我认为我可以这样做。唯一真正耗时的部分是Lucene索引,因为Lucene只能从其索引中插入和删除,所以我必须重新索引整个实体才能更新TagString。我不是很兴奋,但我认为,如果我在后台线程中这样做,这将是罚款。
时间会告诉...
I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.
What is a good database/search model if I want to do all of the following:
- Storing Tags on various entities, (how normalized? i.e. Entity, Tag, and Entity_Tag tables?)
- Searching for items with particular tags
- Building a tag cloud of all tags that apply to a particular search result set
- How to show a tag list for each item in a search result?
Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?
I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.
Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.
So here I go again...
Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.
Regarding the OS projects recommended by @Grant:
- *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
- Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
- Both projects have a 3-table approach to tagging as everyone generally seems to recommend
I also found some other questions on SO that I'd missed before:
- How Do You Recommend Implementing Tags or Tagging?
- How to structure data for searchability?
- Database Design for Tagging
What I'm currently doing for each of the items I mentioned:
- In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
- Build site-wide tag clouds
- browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
- For search I use Lucene + NHibernate.Search
- Tags are concat'd into a TagString that is indexed by Lucene
- So I have the full power of the Lucene query engine (AND / OR / NOT queries)
- I can search for text and filter by tags at the same time
- The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
- Lucene returns a potentially enormous result set, which I paginate to 20 results
- Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
- So it's entirely possible that a search results in 0 hits to the DB
- Tags are concat'd into a TagString that is indexed by Lucene
- Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
- Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.
This means that whenever an Entity's tags are modified, I have to:
- Insert any new Tags that do not already exist
- Insert/Delete from the EntityTag table
- Update Entity.TagString
- Update the Lucene index for the Entity
Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.
Time will tell...
这篇关于用于标记,云和搜索的最佳数据架构(如StackOverflow)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!