本文介绍了如何在Elasticsearch中将某些单词组合成令牌?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于像这是美好的一天"这样的字符串,我想将该字符串标记为标记:这是一个美好的一天,美好的一天",在这里我可以指定一组要组合的单词.在这种情况下,只有美丽"和日子".

For a string like "This is a beautiful day", I want to tokenize the string into tokens:"This, is, a, beautiful, day, beautiful day" where I can specify a certain set of words to combine. In this case only "beautiful" and "day".

到目前为止,我已经使用Shingle过滤器生成令牌列表,如下所示:这是一个美好的一天,一个美好的一天,一天,一天,这一天"

So far, I have used Shingle filter to produce the token list like below:"This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day"

如何进一步过滤上面的令牌列表以产生所需的结果?

How can I further filter the token list above to produce my desired result?

这是我当前的代码:

shingle_filter = {
    "type": "shingle",
    "min_shingle_size": 2,
    "max_shingle_size": 3,
    "token_separator": " "
  }

body = {'tokenizer':'standard','filter':['lowercase', shingle_filter], 'text':sample_text['content'], 'explain':False}

standard_tokens = analyze_client.analyze(body= body, format='text')

推荐答案

经过一番挣扎,似乎predicate_token_filter是我需要的.

After struggling a bit, it seems predicate_token_filter was the one I need.

shingle_filter = {
"type": "shingle",
"token_separator": " "}

predicate_token_filter_temp = {
        "type" : "predicate_token_filter",
    "script" : {
        "source" : "String term = \"beautiful day\"; token.getTerm().toString().equals(term)"
    }}

body = {'tokenizer':'standard','filter':['lowercase', shingle_filter, predicate_token_filter_temp], 'text':sample_text['content'], 'explain':False}

standard_tokens = analyze_client.analyze(body= body, format='text')

我不确定这是最好的方法,但是可以完成工作.

I'm not sure this is the best way to do but it gets the job done.

这篇关于如何在Elasticsearch中将某些单词组合成令牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-24 14:04