本文介绍了HTML 编码问题 - “Â"字符出现而不是“ "的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个旧版应用刚刚开始出现问题,我不确定是什么原因.它生成一堆 HTML,然后由 ActivePDF 转换为 PDF 报告.

I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.

流程如下:

  1. 从数据库中提取一个 HTML 模板,其中包含要替换的令牌(例如~CompanyName~"、~CustomerName~"等)
  2. 用真实数据替换令牌
  3. 使用一个简单的正则表达式函数整理 HTML,该函数对 HTML 标记属性值进行属性格式化(确保引号等,因为 ActivePDF 的渲染引擎讨厌属性值周围的单引号除外)
  4. 将 HTML 发送到创建 PDF 的网络服务.

在混乱中的某个地方,HTML 模板( s)中的不间断空格编码为 ISO-8859-1,因此它们错误地显示为Â" 在浏览器 (FireFox) 中查看文档时的字符.ActivePDF 会在这些非 UTF8 字符上呕吐.

Somewhere in that mess, the non-breaking spaces from the HTML template (the  s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.

我的问题:由于我不知道问题出在哪里,也没有时间进行调查,是否有一种简单的方法可以重新编码或查找并替换坏字符?我已经尝试通过我拼凑的这个小函数发送它,但它 并没有改变任何东西.

My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it doesn't change anything.

Private Shared Function ConvertToUTF8(ByVal html As String) As String
    Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
    Dim source As Byte() = isoEncoding.GetBytes(html)
    Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function

有什么想法吗?

我暂时接受了这个,尽管这似乎不是一个好的解决方案:

I'm getting by with this for now, though it hardly seems like a good solution:

Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
    Return Regex.Replace(html, "[^u0000-u007F]", " ")
End Function

推荐答案

那将是编码为 UTF-8,而不是 ISO-8859-1.ISO-8859-1 中的不间断空格字符是字节 0xA0;当编码为 UTF-8 时,它将是 0xC2,0xA0,如果您(错误地)将其视为 ISO-8859-1,则会显示为 "Â ".这包括您可能没有注意到的尾随 nbsp;如果该字节不存在,则说明有其他东西损坏了您的文档,我们需要进一步查看以找出是什么.

That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

什么是正则表达式,模板是如何工作的?如果您的   字符串(正确地)被转换为 U+00A0 非中断空格字符,那么似乎有一个适当的 HTML 解析器.如果是这样,您可以在 DOM 中本地处理您的模板,并要求它使用 ASCII 编码进行序列化,以保留非 ASCII 字符作为字符引用.这也将阻止您对 HTML 本身进行正则表达式后处理,这始终是一项非常狡猾的业务.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your   strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.

无论如何,现在您可以将以下内容之一添加到文档的 <head> 中,看看它是否使它在浏览器中看起来正确:

Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:

  • 对于 HTML4:<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  • 对于 HTML5:

如果你已经这样做了,那么剩下的任何问题都是 ActivePDF 的错.

If you've done that, then any remaining problem is ActivePDF's fault.

这篇关于HTML 编码问题 - “Â"字符出现而不是“&amp;nbsp;"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-30 21:59