本文介绍了使用正则表达式识别标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何使用一个或多个正则表达式识别具有不同数字标记样式的标题,假设有时文档之间的样式重叠.目标是提取每个文件中特定标题的所有子标题和数据,但这些文件没有标准化.正则表达式在这里是正确的方法吗?

我正在开发一个解析 .pdf 文件并查找特定部分的程序.一旦找到该部分,它就会找到该部分的所有子部分及其内容,并将其存储在 dictionary 中.我首先将整个 pdf 读入一个字符串,然后使用此函数定位标记"部分.

private string GetMarkingSection(string text){int startIndex = 0;int endIndex = 0;bool startIndexFound = false;正则表达式 rx = 新正则表达式(HEADINGREGEX);foreach(在 rx.Matches(text) 中匹配匹配){如果(开始索引找到){endIndex = match.Index;休息;}if (match.ToString().ToLower().Contains("marking")){startIndex = match.Index;startIndexFound = 真;}}return text.Substring(startIndex, (endIndex - startIndex));}

一旦找到标记部分,我就用它来查找子部分.

private DictionaryGetSubsections(字符串文本){字典subsections = new Dictionary();string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);字符串标题 = "";字符串内容 = "";foreach(字符串 s 在 unprocessedSubSecs 中){if(s != "")//有时它会拉入空字符串{匹配 m = Regex.Match(s, SUBSECTIONREGEX);如果(m.成功){标题 = s;}别的{内容 = s;if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title)){subsections.Add(title, content);}}}}返回小节;}

让这些方法按照我希望的方式工作不是问题,问题在于让它们处理每个文档.我正在开发一个商业应用程序,因此任何需要许可证的 API 都不适合我.这些文档的历史从 1 到 16 年不等,因此格式差异很大.

解决方案

看看这个方法是否有效:

var heading1Regex = @"^(\d+)\s(?.*?)$\n(?.*?)$\n*(?=^\d+\s|\Z)";

演示

var heading2Regex = @"^(\d+)\.(\d+)\s(?.*?)$\n(?.*?)$\n*(?=^\d+\.\d+\s|\Z)";

演示

var heading3Regex = @"^(\d+)\.(\d+)\.(\d+)\s(?.*?)$\n(?.*?)$\n*(?=^\d+\.\d+\.\d+\s|\Z)";

演示

对于每个 pdf 文件:

var headingRegex = Heading1Regex;var subHeadingRegex = Heading2Regex;如果 HeadingRegex 有任何匹配项{对于每个匹配项,查找 subHeadingRegex 的匹配项}别的{var HeadingRegex = Heading2Regex;var subHeadingRegex = Heading3Regex;//重复同样的步骤}

1.边缘情况 1:5.2 之后,是 7.1.3

此处所示,使用heading2Regex获取主要部分匹配.

将匹配的 group1 转换为整数

int.TryParse(match.group1, out var HeadingIndex);

获取heading3Regex的子部分匹配

对于每个小节匹配,将 group1 转换为整数.

int.TryParse(match.group1, out var subHeadingIndex);

检查headingIndex 是否等于subHeadingIndex.如不作相应处理.

I'm wondering how I can identify headings with differing numerical marking styles with one or more regular expressions assuming sometimes styles overlap between documents. The goal is to extract all the subheadings and data for a specific heading in each file, but these files aren't standardized. Is regular expressions even the right approach here?

I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.

private string GetMarkingSection(string text)
    {
      int startIndex = 0;
      int endIndex = 0;
      bool startIndexFound = false;
      Regex rx = new Regex(HEADINGREGEX);
      foreach (Match match in rx.Matches(text))
      {
        if (startIndexFound)
        {
          endIndex = match.Index;
          break;
        }
        if (match.ToString().ToLower().Contains("marking"))
        {
          startIndex = match.Index;
          startIndexFound = true;
        }
      }
      return text.Substring(startIndex, (endIndex - startIndex));
    }

Once the marking section is found, I use this to find subsections.

private Dictionary<string, string> GetSubsections(string text)
    {
      Dictionary<string, string> subsections = new Dictionary<string, string>();
      string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);
      string title = "";
      string content = "";
      foreach(string s in unprocessedSubSecs)
      {
        if(s != "") //sometimes it pulls in empty strings
        {
          Match m = Regex.Match(s, SUBSECTIONREGEX);
          if (m.Success)
          {
            title = s;
          }
          else
          {
            content = s;
            if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title))
            {
              subsections.Add(title, content);
            }
          }
        }
      }
      return subsections;
    }

Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me.These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:

  • Heading: (?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
  • Subheading: (?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
  • Master Key: (?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$

Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.

My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.

That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)

I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.

Here is my updated GetMarkingSection method:

private Dictionary<string, string> GetMarkingSection(string text)
    {
      var headingRegex = HEADING1REGEX;
      var subheadingRegex = HEADING2REGEX;
      Dictionary<string, string> markingSection = new Dictionary<string, string>();

      if (Regex.Matches(text, HEADING1REGEX, RegexOptions.Multiline | RegexOptions.Singleline).Count > 0)
      {
        foreach (Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
        {
          if (Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
          {
            if (m.Groups[2].Value.ToLower().Contains("marking"))
            {
              var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
              foreach (Match s in subheadings)
              {
                markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
              }
              return markingSection;
            }
          }
        }
      }
      else
      {
        headingRegex = HEADING2REGEX;
        subheadingRegex = HEADING3REGEX;

        foreach(Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
        {
          if(Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
          {
            if (m.Groups[2].Value.ToLower().Contains("marking"))
            {
              var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
              foreach (Match s in subheadings)
              {
                markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
              }
              return markingSection;
            }
          }
        }
      }
      return null;
    }

Here are some example PDF files:

解决方案

See if this approach works:

var heading1Regex = @"^(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\s|\Z)";

Demo

var heading2Regex = @"^(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\s|\Z)";

Demo

var heading3Regex = @"^(\d+)\.(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\.\d+\s|\Z)";

Demo

For each pdf file:

var headingRegex = heading1Regex;
var subHeadingRegex = heading2Regex;

if there are any matches for headingRegex
{
    for each match, find matches for subHeadingRegex
}
else
{
    var headingRegex = heading2Regex;
    var subHeadingRegex = heading3Regex;
    //repeat same steps
}

1. Edge case 1: after 5.2, comes 7.1.3

As shown here,get main section match using heading2Regex.

convert group1 of the match to integer

int.TryParse(match.group1, out var headingIndex);

get sub section matches for heading3Regex

for each subsection match, convert group1 to integer.

int.TryParse(match.group1, out var subHeadingIndex);

check if headingIndex is equal to subHeadingIndex. if not handle accordingly.

这篇关于使用正则表达式识别标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 07:28
查看更多