我已经从email body中提取了一些与发票相关的信息到Python字符串,我的下一个任务是从字符串中提取发票号。
电子邮件的格式可能会有所不同,因此很难从文本中找到发票号码。我还尝试了SpaCy中的“命名实体识别”,但由于在大多数情况下,发票号是从标题“Invoice”或“Invoice”的下一行输入的,因此NER不理解关系并返回不正确的详细信息。
以下是从邮件正文中提取的两个文本示例:
示例-1。

Dear Customer:
The past due invoices listed below are still pending. This includes the
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.

例2。
Hi - please confirm the status of below two invoices.

Invoice#               Amount               Invoice Date       Due Date
7651234                $19,579.06          29-Jan-19           28-Apr-19
9872341                $47,137.20          27-Feb-19           26-Apr-19

我的问题是,如果我将整个文本转换为一个字符串,那么这将变成如下内容:
Invoice   Date     Purchase Order  Due Date  Balance 8754321   8/17/17
7200016508     9/16/18   140.72

可见,发票号(本例中为8754321)改变了位置,不再跟随关键字“Invoice”,这一点更难找到。
我想要的输出是这样的:
Output Example - 1 -

8754321
5245344

Output Example - 2 -

7651234
9872341

我不知道如何检索关键字“Invoice”或“Invoice”下的文本,这是发票号。
如果需要进一步的信息,请告诉我。谢谢!!
编辑:发票号码没有任何预定义的长度,可以是7位或更多。

最佳答案

根据我的评论编码。

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
    if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
        print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
        index = line.find('Invoice')

使用启发式,使列标题行始终为大小写或大写(ID)。如果标题是“帐号”而不是“帐号”,则此操作将失败
# get all number at a certain index
for line in email.split('\n'):
     words = line[index:].split()
     if words == []: continue
     word = words[0]
     try:
         print(int(word))
     except:
         continue

这里的可靠性取决于数据。所以在“我的代码发票”列中必须是表标题的第一个。也就是说,在“发票”之前不能有“发票日期”。显然,这需要修复。

关于python - 如何从多行字符串中提取特定信息,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56041885/

10-13 05:20