本文介绍了被忽略的XML元素显示在eXist-db的lucene搜索结果附近的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用eXist-db构建一个可处理TEI文件并将其转换为html的应用程序.

I'm building an application with eXist-db which works with TEI files and transform them into html.

对于搜索功能,我将lucene配置为忽略某些标签.

For the search function I configured lucene to ignore some of the tags.

<collection xmlns="http://exist-db.org/collection-config/1.0" xmlns:teins="http://www.tei-c.org/ns/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <fulltext default="none" attributes="false"/>

        <lucene>
        <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
        <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text match="//teins:TEI">

                <inline qname="p"/>
                <inline qname="text"/>

                <ignore qname="teins:del"/>
                <ignore qname="teins:sic"/>
                <ignore qname="teins:index"/>
                <ignore qname="teins:term"/>
                <ignore qname="teins:note"/>

            </text>
        </lucene>


    </index>
</collection>

好吧,这种方法可以解决问题,这些元素不会直接显示在搜索结果中,而是显示在匹配文本之前和之后的代码段中,这些代码由kwic模块返回.有没有办法在索引之前删除它们或应用XSL转换?

Well, that kinda works out, the elements don't show up in the search results directly, but in the snippets before and after the matched text, which are returned by the kwic module. Is there a way to remove them or to apply a XSL transformation before indexing?

示例TEI:

...daß er sie zu entwerten sucht. Wie
                   <index>
                        <term>Liebe</term>
                        <index>
                            <term>und Hass</term>
                        </index>
                    </index>
Liebe Ausströmung inneren Wertes ist,...

当我搜索Ausströmung"时,查询结果进入

When I search for "Ausströmung", the query results into

 ....sucht. Wie Liebe und Hass Liebe    Ausströmung     inneren Wertes ist,...

但是应该导致

 ....sucht. Wie Liebe   Ausströmung     inneren Wertes ist,...

当我搜索"Hass"时,该文本片段不会显示在结果中.

When I search for "Hass" this text snippet does not shows up in the results.

对于搜索功能:我严格遵守文档中的莎士比亚示例.

For the search functions: I'm strictly sticking to the Shakespeare example in the documentation.

推荐答案

让我们从eXist-db的莎士比亚应用程序出发.假设您在那里有索引条目.您不希望使用索引项来命中-索引配置可以解决这个问题-但您也不希望它们输出到KWIC显示器中-您必须自己编写代码.

Let's take point of departure in eXist-db's Shakespeare app. Say you have index entries there. You do not want hits in the index terms - this the index configuration takes care of - but you also do not want them output to the KWIC display - this you have to code yourself.

如果您查看app.xql,将会看到有一个名为app:filter的函数,它是从app:show-hits调用的.您可以根据输出的文本节点的父级名称,使用它来删除输出到KWIC显示屏的部分内容.

If you look in app.xql, you will see there is a function named app:filter called from app:show-hits. This you can use to remove parts of the output to the KWIC display, based on the name of the parent of the text node that is output.

这将提供您想要的:

declare %private function app:filter($node as node(), $mode as xs:string) as xs:string? {
    let $ignored-elements := doc('/db/system/config/db/apps/shakespeare/collection.xconf')//*:ignore/@qname/string()
    let $ignored-elements :=
        for $ignored-element in $ignored-elements
        let $ignored-element := substring-after($ignored-element, ':')
        return $ignored-element
    return
        if (local-name($node/parent::*) = ('speaker', 'stage', 'head', $ignored-elements))
        then ()
        else
            if ($mode eq 'before')
            then concat($node, ' ')
            else concat(' ', $node)
};

您当然可以硬编码要忽略的元素,如在('speaker', 'stage', 'head', 'sic', 'term', 'note')中一样(这里不需要'index',因为您必须始终使用'term'),但是我想表明您不必这样做.但是,如果您不对要忽略的元素进行硬编码,则肯定应该将$ ignored-elements的分配移出函数,例如移至查询序言中声明的变量,因此数据库(collection.xconf)可以不会为遇到的每个文本节点都调用它:这确实很愚蠢,但是为了简单起见,我将所有功能都放在一个函数中.

You can of course hard-code the elements to ignore, as in ('speaker', 'stage', 'head', 'sic', 'term', 'note') ('index' is not needed here since you must always use 'term'), but I wanted to show that you do not have to. However, if you do not hard-code the elements to ignore, you should certainly move the assignment of $ignored-elements out of the function, for instance to a variable declared in the query prolog, so the database (collection.xconf) does not get called for every text node you encounter: this really is stupid, but I have put in all in one function for the sake of simplicity.

PS:名称空间前缀可以是您选择的任何内容,但是 http:/的标准名称空间前缀/www.tei-c.org/ns/1.0 命名空间是"tei",将其更改为"teins"只会导致混乱.

PS: namespace prefixes can be anything you choose, but the standard namespace prefix for the http://www.tei-c.org/ns/1.0 namespace is "tei", and changing it to "teins" can only lead to confusion.

这篇关于被忽略的XML元素显示在eXist-db的lucene搜索结果附近的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 03:52