本文介绍了用电子邮件发送html到csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一封html格式的电子邮件,需要下载它,并且需要将一个csv分号字段分隔符结果保存到一个新文件中.

I have one email with html format and need to download it and need to make one csv semicolon field separator result to a new file.

收到的电子邮件示例:

Content-Type: text/html; charset=UTF-8
<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st= yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">DH</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">Mon</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
(UTC)</th></tr><tr><th style=3D"padding: 8px;">Arael Amarel</th><th style=
=3D"padding: 8px;">30549214</th><th style=3D"padding: 8px;">099981496</th><=
th style=3D"padding: 8px;">43</th><th style=3D"padding: 8px;">-</th><th sty=
le=3D"padding: 8px;">2019-07-11T10:06:34.311Z</th></tr><tr><th style=3D"pad=
ding: 8px;background-color: #dddddd">MATIN TARDEI</th><th style=3D"padding=
: 8px;background-color: #dddddd">45159820</th><th style=3D"padding: 8px;bac=
kground-color: #dddddd">094432451</th><th style=3D"padding:
8px;background-=
color: #dddddd">32</th><th style=3D"padding: 8px;background-color: #dddddd"=
-</th><th style=3D"padding: 8px;background-color: #dddddd">2019-07-
11T10:2=
8:41.198Z</th></tr>

需要csv输出:

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

如果我在Client上打开此邮件,则可以使该表一切正常,但是如果我将procmail的内容(由procmail保存)放入.html文件并打开它,则我认为它存在procmail格式的问题如果我在所有行尾都标记了"=",将使内容无法处理,这意味着很多问题,此外,它们是表格使用中的一些服务器问题以及其他一些问题处理内容的噩梦.

If i open this mail on Client there make the table all ok, but I think it´s there a problem of format with procmail if I put in .html file this content (saved by procmail) of procmail and open it it´s make impossible to process the content if I look this content all the end of line are marked with a "=" wich means a lot of problems, furtermore they are some serveral problems in the aligment of the table and other stuff which make it a nightmare to process the content to extact.

我用过滤器制作了一个procmailrc,将html格式转换为纯格式procmailrc文件:

I had maked one procmailrc with the filter to convert the html format to plainprocmailrc file:

MAILDIR=/new/mail/htmlconvert
:0
* ^Content-Type: text/html.*;
{
:0c
$MAILDIR/converted/
:0fwb
| `which html2text`
:0fwh
| `which formail` -i "Content-Type: text/plain; charset=UTF-8"
}

这是尝试编号1,没有用,如果我直接使用源自源文件的html2text,转换器将使用html2text转换器,很困难.

This is a try number 1, didn't work the converter uses I tough html2text converter if I use html2text directly from the file originated de result is:

html2text

html2text

===============================================================================
 1px solid #dddddd;border-collapse: collapse;text-align: left;">
px;background-color: #cce6ff">NAME
px;background-color: #cce6ff">CI
= px;background-color: #cce6ff">DH
px;backgro= und-color: #cce6ff">FG
px;background-color: #c= ce6ff">Mon
px;background-color: #cce6ff">DATE= (UTC)
px;">Arael Amarel
px;">30549214
px;">099981496
<= th style=3D"padding: 8px;">43
px;">-
px;">2019-07-11T10:06:34.311Z
px;background-color: #dddddd">MATIN TARDEI
 8px;background-color: #dddddd">45159820
px;bac= kground-color: #dddddd">094432451
px;background-= color: #dddddd">32
px;background-color: #dddddd"= >-
px;background-color: #dddddd">2019-07-11T10:2= 8:41.198Z
px;">

已经尝试了lynx -dump -force-html到文件,结果对达到csv格式的输出也没有好处.

Already tryied lynx -dump -force-html to the file and the result is´t nothing good to reach the format csv output.

html2text -nobs (file)

Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z

更新:我已将三元组的解决方案应用于procmailrc,但是邮件的格式仍与原始来源相同,qprint并没有更改此格式.但是,尝试将其直接添加到文件中并可以正常工作.实际解决方案:

Update:I have applied the solution of tripleee to the procmailrc, however the format of the mail is still the same of the original source, the qprint didn't change the format with this change. However have tried to make it directly to the file and works fine.The actual solution:

qprint -d -n <1563019338.1197_0.localhost.localdomain |
html2text -style pretty |
awk '/^-------------------------------------------------------------------------------/{p=1}p'

-行是邮件正文和before内容的分隔符,显示如下:

The - line is the separator of the body of the mail and the before content, this shows out:

-------------------------------------------------------------------------------

NAME         CI       CD   FG  HJ DATE (UTC)
Yaiaa Fereeira        52104575 097325303 20    -     2019-07-12T10:46:24.716Z
Gabtiel Aosta Sclavi   42445135 098322361 42    -     2019-07-12T11:07:36.110Z

现在需要将该内容发布到csv中,我认为第一部分会更容易,但是希望将其自动化到procmail以便通过邮件下载进行处理.

Need now to make this content to the csv out, I thought it will be more easy to the first part but want to automate it to the procmail to do it with the mail download.

procmail更改procmailrc的结果是邮件的正文仍以"="作为行尾,但标头中包含:

The result of procmail changing the procmailrc is the mail with the body still having the "=" as line end, but in the header have:

Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

更新procrc中带有qprint的电子邮件结果源

UpdateThe email result source with qprint in the procrc

Return-Path:
Delivered-To:
Return-path:
Envelope-to:
Delivery-date: Sat, 13 Jul 2019 08:03:48 -0300
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Date: Sat, 13 Jul 2019 11:03:02 +0000 (UTC)
From:
Mime-Version: 1.0
To:
Message-ID:
Subject:Fri Jul 12 2019
X-Spam-Flag: NO

<b>Fri Jul 12 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">NAME</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">CD</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">HJ</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
 (UTC)</th></tr><tr><th style=3D"padding: 8px;">Yaiaa Fereeira</th><th st=
yle=3D"padding: 8px;">52104575</th><th style=3D"padding: 8px;">097325303</t=
h><th style=3D"padding: 8px;">20</th><th style=3D"padding: 8px;">-</th><th =
style=3D"padding: 8px;">2019-07-12T10:46:24.716Z</th></tr>

我在标准输入中有日志,因为procmail无法写入日志文件,如您在此日志详细信息中所见:

I have the log in the stdin because procmail can`t write logfile as you can see in this log detail:

1 message for [email protected] at aaa.com (25330 octets).
reading message [email protected]@aaa.com:1 of 1 (25330 octets)........................procmail: Error while writing to "/info/in/log"
procmail: [20191] Mon Jul 15 08:55:34 2019
procmail: Assigning "FORMAIL=/usr/bin/formail"
procmail: Assigning "QPRINT=/usr/local/bin/qprint"
procmail: Match on "^Content-Type: text/html;"
procmail: Assigning "LASTFOLDER=converted/new/1563191734.20191_0.localhost.localdomain"
 Subject: Sun Jul 14 2019
  Folder: converted/new/1563191734.20191_0.localhost.localdomain          24985
procmail: Executing " qprint -d -n | html2text -nobs "
procmail: Executing " formail -I "Content-Type: text/html; charset=UTF-8"
procmail: Skipped "Mail"
procmail: Skipped "/"
From [email protected]  Mon Jul 15 08:55:34 2019
 Subject: Sun Jul 14 2019
  Folder: **Bounced**                                                     24985
fetchmail: MDA returned nonzero status 73
 not flushed

推荐答案

您的帖子中的示例根本看起来像不是有效的电子邮件正文.我正在猜测,它是MIME邮件中带有Content-type: text/html(如模糊表示)和Content-transfer-encoding: quoted-printabe的正文部分.后者是引入=转义的原因,您认为这是有问题的.解码它们实际上是微不足道的,但是如何从Procmail中准确地解码它们取决于包含消息的整体组成以及您可以使用的实用程序.不幸的是,Procmail本身对MIME结构一无所知,因此您必须依靠外部工具.

The sample in your post does not look like a valid email body at all. I'm guessing it's a body part within a MIME message with Content-type: text/html (as vaguely indicated) and Content-transfer-encoding: quoted-printabe. The latter is what introduces the = escapes which you regard as problematic. Decoding them is actually fairly trivial, but how exactly to do that from Procmail depends on the overall composition of the containing message, and the utilities available to you. Unfortunately, Procmail itself has no idea about MIME structures, so you'll have to rely on external tools.

顺便说一句,配方中的`which ...`命令是完全多余的.为了使which正常工作,您需要的实用程序必须位于PATH ...中,这意味着Procmail可以在没有which的情况下找到它们.

As an aside the `which ...` commands in your recipe are completely redundant. For which to work, the utilities which you are looking for need to be in your PATH ... which means Procmail can find them without which.

如果Procmail的默认PATH中没有包含某些内容,只需更新.procmailrc文件顶部附近的PATH.这也应该消除使用$FORMAIL等变量的需要.只需使用formail并确保它在Procmail的PATH中可用.

If something is not in Procmail's default PATH, simply update PATH near the top of your .procmailrc file. This should also remove the need to use variables like $FORMAIL etc. Just use formail and make sure it's available on Procmail's PATH.

要使您的食谱有效,MIME结构必须为单部分消息.如果确实如此,并且html2text在其他方面是正确的,那么您唯一需要解决的就是在进行内容传递编码之前对其进行解码.假设您有 qprint ,并且多余的which通话被删除,离开

For your recipe to work, the MIME structure needs to be a single-part message. If that is indeed the case, and your html2text is otherwise correct, the only fix you need is to decode the content-transfer-encoding before piping through that. Assuming you have qprint, and with the superfluous which calls removed, that leaves

:0
* ^Content-Type: text/html.*;
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | qprint -d | html2text
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

如果实际上MIME主体结构更复杂,则可以编辑您的问题以包括实际的电子邮件来源,而不是当前的ad-lib解释.

If in fact the MIME body structure is more complex, perhaps edit your question to include the actual email source instead of your current ad-lib paraphrase.

换句话说,更详细地讲,如果您输入的消息看起来像

In other words, and in some more detail, if your input message looks like

From: sender <[email protected]>
To: you <[email protected]>
Subject: HTML table
MIME-Version: 1.0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

然后,上面的配方应该基本可以使用.但另一方面,如果您的实际信息更像是

then the recipe above should basically work. But on the other hand, if your actual message is more like

From: sender <[email protected]>
To: you <[email protected]>
Subject: HTML table
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=0xdeadbeef

This is a multi-part MIME message.

--0xdeadbeef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color:....

--0xdeadbeef--

然后第一个条件将不匹配(标头不包含Content-type: text/html),但是由于需要拆开环绕HTML正文部分的MIME,因此还需要在多个位置更新块中的动作.或以其他方式进行重组.这是解决这一问题的一种非常快捷而肮脏的尝试.

then the first condition will not match (the headers don't contain Content-type: text/html), but the actions inside the block will also need to be updated in several places because the MIME wrapping around the HTML body part needs to be unwrapped or somehow otherwise restructured. Here is a really quick and dirty attempt at solving this.

:0
* ^Content-Type: multipart/mixed
{
  :0c  # no need to spell out $MAILDIR/ prefix
  converted/
  :0fwb
  | perl -0777 -pe 's/=([0-9A-F]{2})/ chr(oct("0x$1"))/ge; \
    s/=\n//g; \
    s%</table>.*%%s; \
    s%.*<table[^<>]*>%%s; \
    s%<tr[^<>]*><t[dh][^<>]*>%\n%g; \
    s%<t[dh][^<>]*>%;%g; \
    s%</t[rdh]>%%g; \
    s%^\n+%%;'
  :0fwh
  | formail -i "Content-Type: text/plain; charset=UTF-8" \
        -i "Content-transfer-encoding: 8bit"
}

在进行少量修改后,它也应适用于单部分版本.但是您应该意识到Perl脚本是一个粗略的选择,而不是适当的HTML解析器.

With minor adaptations, it should work for the single-part variation, too. But you should realize that the Perl script is a really rough cut, not a proper HTML parser.

f标志使Procmail将输入消息替换为管道的输出.然后需要formail调用,因为用不同类型和不同编码的内容替换原始内容后,原始MIME头不再正确.如果您只是想将CSV数据提取到一个外部文件中,则可以跳过后者,而可以将前者简化为

The f flag causes Procmail to replace the input message with the output from the pipeline. The formail call is then necessary because the original MIME headers are no longer correct after you have replaced the original content with content of a different type and with a different encoding. If you just want to extract the CSV data into an external file instead, the latter can be skipped and the former can be simplified to just

:0
* ^Content-type: text/html
{
  :0c
  converted/
  :0b  # no w flag necessary either once we drop f
  | qprint -d | html2text >>result.csv
}

在此我们再次假设一个单部分的MIME消息作为输入.是否覆盖输出文件而不是附加输出文件(或者每次都写入一个不同的CSV文件)将取决于您的特定用例以及您希望多久收到一次这些消息.

where again we assume a single-part MIME message as input. Whether to overwrite the output file instead of appending (or perhaps write to a different CSV file each time) will depend on your specific use case, and how often you expect to receive these messages.

(不是特别认可qprint;有许多类似的实用程序,但是没有什么特别普遍的.不幸的是,GNU Coreutils维护者坚定地拒绝包含类似的实用程序.)

(Not in particular an endorsement of qprint; there are many comparable utilities, but nothing particularly ubiquitous. It's unfortunate that the GNU Coreutils maintainers steadfastly refuse to include a similar utility.)

这篇关于用电子邮件发送html到csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 19:44