我需要解析实时聊天对话的笔录。我看到文件的最初想法是针对该问题抛出正则表达式,但我想知道人们使用了哪些其他方法。

我以前在标题中加上了优雅,因为我之前发现这种类型的任务有难以仅依靠正则表达式来维持的危险。

笔录由www.providesupport.com生成,并通过电子邮件发送到一个帐户,然后从电子邮件中提取纯文本笔录附件。

解析文件的原因是要提取对话文本以供以后使用,而且还要标识访问者和运算符(operator)的姓名,以便可以通过CRM使用该信息。

这是一个成绩单文件的示例:

Chat Transcript

Visitor: Random Website Visitor
Operator: Milton
Company: Initech
Started: 16 Oct 2008 9:13:58
Finished: 16 Oct 2008 9:45:44

Random Website Visitor: Where do i get the cover sheet for the TPS report?
* There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button
* Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.
Milton: Y-- Excuse me. You-- I believe you have my stapler?
Random Website Visitor: I really just need the cover sheet, okay?
Milton: it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire...
Random Website Visitor: oh i found it, thanks anyway.
* Random Website Visitor is now off-line and may not reply. Currently in room: Milton.
Milton: Well, Ok. But… that's the last straw.
* Milton has left the conversation. Currently in room:  room is empty.

Visitor Details
---------------
Your Name: Random Website Visitor
Your Question: Where do i get the cover sheet for the TPS report?
IP Address: 255.255.255.255
Host Name: 255.255.255.255
Referrer: Unknown
Browser/OS: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)

最佳答案

不,实际上,对于您描述的特定类型的任务,我怀疑有没有比正则表达式更“干净”的方法了。看来您的文件已嵌入换行符,因此通常我们在这里要做的是应用每行正则表达式使该行成为分解单位。同时,您创建一个小型状态机并使用正则表达式匹配来触发该状态机中的转换。通过这种方式,您知道文件中的位置以及可以预期的字符数据类型。另外,请考虑使用命名捕获组并从外部文件加载正则表达式。这样,如果您的成绩单格式发生变化,则只需调整正则表达式即可,而不必编写特定于解析的新代码。

07-27 21:15
查看更多