本文介绍了使用MarkLogic mlcp拆分文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要分割这份文件

<?xml version="1.0"?>
<!DOCTYPE docs SYSTEM "../rom11.dtd">
<docs>
  <stwtext id="RD-10-00258" update="03.2011" seq="RQ-10-00001">
    <head>
      <ti>
        <i>j</i>
      </ti>
      <ff-list>
        <ff id="0103" />
      </ff-list>
    </head>
    <p>
      Symbol f&#x00FC;r die
      <vw idref="RD-19-04447">Stromdichte</vw>
      .
    </p>
  </stwtext>

  <stwtext id="RD-10-00209" update="12.2007" seq="RQ-10-00223">
    <head>
      <ti>JZ</ti>
      <ff-list>
        <ff id="0932" />
      </ff-list>
    </head>
    <p>
      Abk&#x00FC;rzung f&#x00FC;r Jod-Zahl, siehe
      <vw idref="RD-06-00645">Fettkennzahlen</vw>
      .
    </p>
  </stwtext>

</docs>

我通过以下命令执行此操作:

i do it with this command:

~> bin/mlcp.sh IMPORT -mode local -host localhost -port 15000 \
  -username admin -password admin \
  -input_file_path /media/sf_vm.shared/theme/rom-training/v10.new-ML.XML \
  -output_uri_replace "/media/sf_vm.shared/theme/rom-training/keywords,'rom-data'" \
  -output_collections rom-data \
  -input_file_type aggregates -aggregate_record_element stwtext \
  -aggregate_uri_id @id

该命令运行正常,但是我在MarkLogic中看到带有id的文档,这些文档不属于声明的stwtext.id,而是属于最后一个元素的id.例如,我希望看到我的文档

The command works fine, but I see in MarkLogic the documents with ids, which don't belong to declared stwtext.id, but to the id of last element. For example, for my document I am expecting to see

RD-10-00258
RD-10-00260

但实际上看起来像这样:

but actually it looks like this:

0103
0932

是错误,还是我做错了什么?谢谢

Is it bug, or perhaps I did something wrong ?thanks

推荐答案

这是一个错误.如果愿意,可以下载MLCP的源代码并进行更改.看一下AggregateXMLReader.java的processStartElement().

It's a bug. If you'd like to, you can download the source code for MLCP and change it. Take a look at AggregateXMLReader.java's processStartElement().

这篇关于使用MarkLogic mlcp拆分文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 00:50