本文介绍了Solr DataImportHandler的Chunked UrlDataSource的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑将我的数据源分块以便将优化数据导入到solr中,并且想知道是否可以使用将数据分块的主网址。

I'm looking into chunking my data source for optimial data import into solr and was wondering if it was possible to use a master url that chunked data into sections.

例如,文件1可能有

<chunks>
  <chunk url="http://localhost/chunker?start=0&stop=100" />
  <chunk url="http://localhost/chunker?start=100&stop=200" />
  <chunk url="http://localhost/chunker?start=200&stop=300" />
  <chunk url="http://localhost/chunker?start=300&stop=400" />
  <chunk url="http://localhost/chunker?start=400&stop=500" />
  <chunk url="http://localhost/chunker?start=500&stop=600" />
</chunks>

每个块网址会导致类似

<items>
   <item data1="info1" />
   <item data1="info2" />
   <item data1="info3" />
   <item data1="info4" />
</iems>

我正在处理500多万条记录,所以我认为数据需要分块到避免内存问题(使用SQLEntityProcessor时遇到此问题)。我还想避免提出超过500万的网络请求,因为这可能会变得昂贵我认为

I'm working with 500+ million records so I think that the data will need to be chunked to avoid memory issues (ran into that when using the SQLEntityProcessor). I would also like to avoid making 500+ Million web requests as that could get expensive I think

推荐答案

由于缺乏示例在互联网上,我想我会发布我最终使用的内容

Due to the lack of examples on the internet, I figured I would post what I ended up using

<?xml version="1.0" encoding="utf-8"?>
<result>
  <dataCollection func="chunked">
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data hasmore="true" nexturl="http://server.domain.com/handler?start=0&amp;end=1000000000&amp;page=1&amp;pagesize=10"
  </dataCollection>
</result>

重要的是要注意我使用指定下一页还有更多内容并提供网址下一页。这与一致。请注意,文档指定分页Feed应该告诉系统它有更多以及在哪里获得下一批。

It's important to note that I use specify that there is more on the next page and provide a url to the next page. This is consistant with the Solr Documentation for DataImportHandlers. Please note that the documentation specifies that the paginated feed should tell the system that it has more and where to get the next batch.

<dataConfig>
    <dataSource name="b" type="URLDataSource" baseUrl="http://server/" encoding="UTF-8" />
    <document>
        <entity name="continue"
                dataSource="b"
                url="handler?start=${dataimport.request.startrecord}&amp;end=${dataimport.request.stoprecord}&amp;pagesize=100000"
                stream="true"
                processor="XPathEntityProcessor"
                forEach="/result/dataCollection/data"
                transformer="DateFormatTransformer"
                connectionTimeout="120000"
                readTimeout="300000"
                >
            <field column="id"  xpath="/result/dataCollection/data/@info" />
            <field column="id"  xpath="/result/dataCollection/data/@info" />
            <field column="$hasMore" xpath="/result/dataCollection/data/@hasmore" />
            <field column="$nextUrl" xpath="/result/dataCollection/data/@nexturl" />
        </entity>
    </document>

注意$ hasMore和$ nextUrl字段。您可能希望放置超时。我还建议允许指定页面大小(它有助于使用tweeking设置来获得最佳处理速度)。我使用四核Xeon处理器和32GB内存的单个服务器上的多核(3)solr实例索引@大约每秒12.5K记录。

Note the $hasMore and $nextUrl fields. You may want to place with the timeouts. I also recomend allowing for specifying page size (it helps with tweeking settings to get optimal processing speed). I'm indexing @ about 12.5K records per second using a multicore (3) solr instance on a single server with a quad core Xeon processor and 32GB of ram.

分页结果的应用程序使用与存储数据的SQL服务器相同的系统。当我们最终对solr服务器进行负载平衡时,我也会通过开始和停止位置来最小化配置更改....

The app paginating the results is uses the same system as does the SQL server storing the data. I'm also passing in the start and stop positions to minimize configuration changes when we eventually load balance the solr server....

这篇关于Solr DataImportHandler的Chunked UrlDataSource的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 18:44