如何找到字符串上正则表达式匹配的百分比?

本文介绍了如何找到字符串上正则表达式匹配的百分比?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我参与了一项数字无线电传播研究，其中远程发射器在定义的时间发送预定义的信标，该信标很容易与正则表达式匹配.

I'm involved in a digital radio propagation study where a remote transmitter sends a predefined beacon at a defined time that's easily matched with a regex.

但由于太阳和大气条件，它并不总是 100% 解码.我想要做的是计算解码的百分比.

But due to solar and atmospheric conditions it's not always a 100% decoded. What I want to do is calculate the percentage of the decode.

信标格式如下:

de va6shs va6shs va6shs Loc DO46gs Olivia-4-250 NBEMS test 2218Z
     |                        |          |                   |
 (Station)               (Location) (Digital Mode)       (UTC Time)

我真的可以用 Perl 计算出百分比，还是应该寻找其他解决方案?

Can I actually figure out the percentage with Perl, or should I be looking for another solution?

经常发生的事情是因为我们使用的数据模式中纠错能力有限，因此随机字符经常出现在解码字符串中，或者根本没有解码这些字符是在不同时间从同一站接收到的字符串同一天，因为太阳条件恶化.

What often happens as there is limited error correction in the data mode we are using so random characters often end up in the decoded string or characters are not decode at all these are received strings from the same station at different times of the same day as solar conditions degraded.

100% decode
de ve6rfm ve6rfm ve6rfm Loc DO46gs Olivia-4-250 NBEMS test 0218Z

93.75%
P!de ve6rfm ve6rfm ve6rfm Loc DO46gs Olivia-4-250 NBEMS <TAB>est F248Z

9.375%
de ve6rfmr&

两个信标字符串之间应该存在的唯一区别是字符串末尾的 UTC 时间，但正如您所看到的，有几个字符没有正确解码.

The only difference there should be between the two beacon strings is the UTC time at the end of the string, but as you can see there's a few characters that didn't decode correctly.

正确解码的字符串有 64 个字符.第一个错误解码的字符串有 60 个正确的字符.所以 60/64 * 100 = 93.75% 解码.

The correctly decodes string has 64 characters.The first incorrectly decoded string has 60 correct characters.So 60/64 * 100 = 93.75% decode.

我的车站呼号的正则表达式，三个重复的词是

My regex for the station call sign, the three repeated words is

 /[vV][aAeEyY][15678]\w{2,3}/

加拿大西部有几个不同的电台参与了这项研究，所以我需要在传播许可的情况下捕获它们，使用上述正则表达式可以让我不必在每次有新电台播出时更新我的脚本.

There are several different stations involved in the study across western Canada so I need to capture them as propagation permits, and using the above regex saves me from having to update my script every time a new station comes on the air.

推荐答案

问题是部分匹配或模糊匹配之一.有一些模块可能会有所帮助.他们主要使用 Levenshtein distance，即从另一个字符串获取一个字符串所需的编辑次数，但还有其他方法.请参阅 Text::Levenshtein 中的部分列表.请参阅这篇文章，了解将提供更多搜索词组.

The problem is one of partial or fuzzy matching. There are modules out there that may help. They mostly use Levenshtein distance, the number of edits needed to get one string from the other, but there are other methods. See a partial list in Text::Levenshtein. See this post for search phrases that will offer far more.

以下是使用 String::Approx 的示例，String::Similarity 和 Text::Fuzzy.没有准确地提供您所要求的，但都检索类似的度量，并且有可能让您获得目标的选项.

Here are examples using String::Approx, String::Similarity, and Text::Fuzzy. None gives exactly what you ask but all retrieve similar measures, and have options that may allow you to get your target.

use warnings 'all';
use strict;

my $beacon =
    'de va6shs va6shs va6shs Loc DO46gs Olivia-4-250 NBEMS test 2218Z';
my $received =
    'P!de ve6rfm ve6rfm ve6rfm Loc DO46gs Olivia-4-250 NBEMS <TAB>est F248Z';

# Can use an object, or the functional interface
use Text::Fuzzy qw(fuzzy_index distance_edits);
my $tf = Text::Fuzzy->new ($beacon);

my ($offset, $edits, $distance);
# Different distance/edits
$distance = $tf->distance($received);
($offset, $edits, $distance) = fuzzy_index    ($received, $beacon);
($distance, $edits)          = distance_edits ($received, $beacon);

# Provides "similarity", in terms of edit distance
use String::Similarity;
my $similarity = similarity $beacon, $received;

# Can be tuned, but is more like regex in some sense. See docs.
use String::Approx qw(amatch);
my @matches = amatch($beacon, $received);  # within 10%
# amatch($beacon, ["20%"], $received);     # within 20%
# amatch($beacon, ["S0"], $received);      # no "substitutions"

请查看他们的文档.

String::Approx 如果长度不超过 10%，则认为是匹配".这是默认设置，模块允许调整该参数.例如，

The String::Approx considers a "match" if it is not further than 10% in length. This is the default, and the module allows to adjust that parameter. For example,

amatch($beacon, ["20%"], $received);

将达到 20%.可以进行其他可能对您有用的改进.该模块的较新版本是用 C 语言编写的，而且性能要好得多.

would make that 20%. Other refinements of possible use for you can be made. Newer versions of the module are written in C and are much better perfoming.

这篇关于如何找到字符串上正则表达式匹配的百分比?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！