php - 如何从文本块中提取权重和其他元数据？

需要处理的示例文本为：

格伦斯塔尔（Glenstal）超成熟COL切达（200 GMS），原装华夫饼干公司。
英语130G，LIFCO切碎的MOZAREAL-500GM，CAPRICON美味面包
-大，鲁西恩多谷物面包，有机混合果汁10X200ML，可乐330ML（016）凤凰有机，果汁10X 200ML，
有机果汁500ML10X

我必须从本文中提取重量，单位以及包装是否可用，例如“ 10X或6X”。我试图使用正则表达式解决它，但是它不能在所有条件下都起作用。

我尝试过的代码是：

$weight_unit = explode(" ", $title_string);
 $units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
 for ($m = 0; $m < sizeof($weight_unit); $m++) {
   foreach ($units as $unit) {
     if (preg_match('/^[0-9A-Z.]*([0-9][A-Z]|[A-Z][0-9])[0-9A-Z]*$/',
          $weight_unit[$m]) && strpos($weight_unit[$m], $unit) !== FALSE) {
          $product["weight"] = preg_replace("/[A-Za-z]/", '', $weight_unit[$m]);
          $product["unit"] = $unit;
          break;
      }
   }
 }

最佳答案

尝试仅使用一个正则表达式来完成所有这一切可能不值得您麻烦。也许您可以使它正常工作，但是除非她习惯于吹口哨，否则下一个要处理它的人将很难。 :-)让我们尝试一系列嵌套循环。

$txt = "GLENSTAL EXTRA MATURE COL CHEDDAR 200 GMS, ORIGINAL WAFFLES CO. ENGLISH 130G, LIFCO-SHREDDED MOZAREAL-500GM, CAPRICON TASTY BREAD -BIG, LUSINE MULTI GRAIN SLICED BREAD, ORGANIC MIXED FRUITS JUICE 10X200ML, COLA 330ML(016) PHOENIX ORGANIC, FRUITS JUICE 10X 200ML, ORGANIC FRUITS JUICE 500ML10X";
$units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
/* break up your string at the commas, so you handle each item by itself */
$items = preg_split("/\s*,\s*/", $txt);

/* work through the items one by one */
foreach ($items as $item) {
    $amtnum = 1;
    $amtunit = "";
    $packnum = "1";

    /* break up the item description into tokens, where
     * each number string and letter string gets its own token.
     * deal with (123) parenthesized number strings as well.
     *   e.g.   "FRUITS JUICE" "10" "X" "200" "ML"
     *   and    "COLA" "330" "ML" "(016)" "PHOENIX ORGANIC"
     */
    $toks = preg_split("/(\(\d+\)|\d+|[^\d\(\)]+)/", $item,-1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
    /* work backward through array of tokens, using array_pop */
    while ($tok = array_pop($toks)) {
        /* is the present token in your array of units? */
        if (in_array(strtoupper($tok), $units)) {
            /* yes. grab next token as the number of units */
            $amtunit = $tok;
            $amtnum = array_pop($toks);
        }
        /* is this an X (for a 16X pack or some such thing ? */
        if ($tok == 'X') {
            /* yes, grab next token as the number of items in the pack */
            $packnum = array_pop($toks);
        }
        /* do what you will with the result */
        echo $amtnum, $amtunit, $packnum;
    }
}

该行是此解决方案的关键。让我们检查一下。

    $toks = preg_split(
            "/(\(\d+\)|\d+|[^\d\(\)]+)/",
            $item,-1,
            PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

preg_split将字符串拆分为数组。 PREG_SPLIT_DELIM_CAPTURE作为修饰符意味着将正则表达式中的内容包括在结果数组中。 PREG_SPLIT_NO_EMPTY表示结果数组中不要包含空字符串。

让我们看一下正则表达式本身。我将添加空格以使其更易于阅读。

(  \(\d+\)  |  \d+  |  [^\d\(\)]+  )

它以括号()开头和结尾。这与PREG_SPLIT_DELIM_CAPTURE一起。

然后，它包含三个备选匹配表达式，以|分隔。

第一个是括号，数字和括号。与测试数据集中的字符串(016)匹配。

第二个是素数。匹配“ 300”之类的东西。

第三个是一串字母，空格等，除了数字和括号外的任何东西。例如，与“ GMS”和“水果果汁”匹配。

这可能是使用正则表达式执行此解析工作的一种相当可靠的方法。