本文介绍了使用MarkLogic mlcp拆分文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要分割这份文件
<?xml version="1.0"?>
<!DOCTYPE docs SYSTEM "../rom11.dtd">
<docs>
<stwtext id="RD-10-00258" update="03.2011" seq="RQ-10-00001">
<head>
<ti>
<i>j</i>
</ti>
<ff-list>
<ff id="0103" />
</ff-list>
</head>
<p>
Symbol für die
<vw idref="RD-19-04447">Stromdichte</vw>
.
</p>
</stwtext>
<stwtext id="RD-10-00209" update="12.2007" seq="RQ-10-00223">
<head>
<ti>JZ</ti>
<ff-list>
<ff id="0932" />
</ff-list>
</head>
<p>
Abkürzung für Jod-Zahl, siehe
<vw idref="RD-06-00645">Fettkennzahlen</vw>
.
</p>
</stwtext>
</docs>
我通过以下命令执行此操作:
i do it with this command:
~> bin/mlcp.sh IMPORT -mode local -host localhost -port 15000 \
-username admin -password admin \
-input_file_path /media/sf_vm.shared/theme/rom-training/v10.new-ML.XML \
-output_uri_replace "/media/sf_vm.shared/theme/rom-training/keywords,'rom-data'" \
-output_collections rom-data \
-input_file_type aggregates -aggregate_record_element stwtext \
-aggregate_uri_id @id
该命令运行正常,但是我在MarkLogic中看到带有id的文档,这些文档不属于声明的stwtext.id,而是属于最后一个元素的id.例如,我希望看到我的文档
The command works fine, but I see in MarkLogic the documents with ids, which don't belong to declared stwtext.id, but to the id of last element. For example, for my document I am expecting to see
RD-10-00258
RD-10-00260
但实际上看起来像这样:
but actually it looks like this:
0103
0932
是错误,还是我做错了什么?谢谢
Is it bug, or perhaps I did something wrong ?thanks
推荐答案
这是一个错误.如果愿意,可以下载MLCP的源代码并进行更改.看一下AggregateXMLReader.java的processStartElement()
.
It's a bug. If you'd like to, you can download the source code for MLCP and change it. Take a look at AggregateXMLReader.java's processStartElement()
.
这篇关于使用MarkLogic mlcp拆分文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!