需要处理的示例文本为:
格伦斯塔尔(Glenstal)超成熟COL切达(200 GMS),原装华夫饼干公司。
英语130G,LIFCO切碎的MOZAREAL-500GM,CAPRICON美味面包
-大,鲁西恩多谷物面包,有机混合果汁10X200ML,可乐330ML(016)凤凰有机,果汁10X 200ML,
有机果汁500ML10X
我必须从本文中提取重量,单位以及包装是否可用,例如“ 10X或6X”。我试图使用正则表达式解决它,但是它不能在所有条件下都起作用。
我尝试过的代码是:
$weight_unit = explode(" ", $title_string);
$units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
for ($m = 0; $m < sizeof($weight_unit); $m++) {
foreach ($units as $unit) {
if (preg_match('/^[0-9A-Z.]*([0-9][A-Z]|[A-Z][0-9])[0-9A-Z]*$/',
$weight_unit[$m]) && strpos($weight_unit[$m], $unit) !== FALSE) {
$product["weight"] = preg_replace("/[A-Za-z]/", '', $weight_unit[$m]);
$product["unit"] = $unit;
break;
}
}
}
最佳答案
尝试仅使用一个正则表达式来完成所有这一切可能不值得您麻烦。也许您可以使它正常工作,但是除非她习惯于吹口哨,否则下一个要处理它的人将很难。 :-)让我们尝试一系列嵌套循环。
$txt = "GLENSTAL EXTRA MATURE COL CHEDDAR 200 GMS, ORIGINAL WAFFLES CO. ENGLISH 130G, LIFCO-SHREDDED MOZAREAL-500GM, CAPRICON TASTY BREAD -BIG, LUSINE MULTI GRAIN SLICED BREAD, ORGANIC MIXED FRUITS JUICE 10X200ML, COLA 330ML(016) PHOENIX ORGANIC, FRUITS JUICE 10X 200ML, ORGANIC FRUITS JUICE 500ML10X";
$units = array("LITRE", "LTRS", "LTR", "LIT", "GMS", "LBS", "KG", "GM", "GR", "ML", "OZ", "LB", "G", "L");
/* break up your string at the commas, so you handle each item by itself */
$items = preg_split("/\s*,\s*/", $txt);
/* work through the items one by one */
foreach ($items as $item) {
$amtnum = 1;
$amtunit = "";
$packnum = "1";
/* break up the item description into tokens, where
* each number string and letter string gets its own token.
* deal with (123) parenthesized number strings as well.
* e.g. "FRUITS JUICE" "10" "X" "200" "ML"
* and "COLA" "330" "ML" "(016)" "PHOENIX ORGANIC"
*/
$toks = preg_split("/(\(\d+\)|\d+|[^\d\(\)]+)/", $item,-1, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
/* work backward through array of tokens, using array_pop */
while ($tok = array_pop($toks)) {
/* is the present token in your array of units? */
if (in_array(strtoupper($tok), $units)) {
/* yes. grab next token as the number of units */
$amtunit = $tok;
$amtnum = array_pop($toks);
}
/* is this an X (for a 16X pack or some such thing ? */
if ($tok == 'X') {
/* yes, grab next token as the number of items in the pack */
$packnum = array_pop($toks);
}
/* do what you will with the result */
echo $amtnum, $amtunit, $packnum;
}
}
该行是此解决方案的关键。让我们检查一下。
$toks = preg_split(
"/(\(\d+\)|\d+|[^\d\(\)]+)/",
$item,-1,
PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
preg_split
将字符串拆分为数组。 PREG_SPLIT_DELIM_CAPTURE
作为修饰符意味着将正则表达式中的内容包括在结果数组中。 PREG_SPLIT_NO_EMPTY
表示结果数组中不要包含空字符串。让我们看一下正则表达式本身。我将添加空格以使其更易于阅读。
( \(\d+\) | \d+ | [^\d\(\)]+ )
它以括号
()
开头和结尾。这与PREG_SPLIT_DELIM_CAPTURE
一起。然后,它包含三个备选匹配表达式,以
|
分隔。第一个是括号,数字和括号。与测试数据集中的字符串
(016)
匹配。第二个是素数。匹配“ 300”之类的东西。
第三个是一串字母,空格等,除了数字和括号外的任何东西。例如,与“ GMS”和“水果果汁”匹配。
这可能是使用正则表达式执行此解析工作的一种相当可靠的方法。