问题描述
每行都是一个字符串
 4
 minutes
 12
 minutes
 16
 minutes
我可以使用 str_replace
成功删除Â
,但不能HTML实体。我发现这个问题:
I was able to remove the Â
successfully using str_replace
but not the HTML entity. I found this question: How to remove html special chars?
但是preg_replace没有做这个工作。如何删除HTML实体和A?
But the preg_replace did not do the job. How can I remove the HTML entity and that A?
编辑:
我想我应该早点说过:我是使用 DOMDocument :: loadHTML()
和 DOMXpath
。
编辑:
由于这似乎是一个编码问题,我应该说这实际上是所有单独的字符串。
I think I should have said this earlier: I am using DOMDocument::loadHTML()
and DOMXpath
.Since this seems like an encoding issue, I should say that this is actually all separate strings.
推荐答案
好吧 - 我想我现在有一个句柄 - 我想扩展一些人们得到的编码错误:
Alright - I think I've got a handle on this now - I want to expand on some of the encoding errors that people are getting at:
这似乎是Mojibake的一个高级案例,但这里是我认为的。 MikeAinOz原来怀疑这是UTF-8数据可能是真的。如果我们采取以下UTF-8数据:
This seems to be an advanced case of Mojibake, but here is what I think is going on. MikeAinOz's original suspicion that this is UTF-8 data is probably true. If we take the following UTF-8 data:
4& nbsp;分钟
现在,删除HTML实体,并将其替换为实际对应的字符:U + 00A0。 (这是一个不间断的空间,所以我不能完全显示你,你得到的字符串:4分钟编码为UTF-8,你得到以下字节序列:
Now, remove the HTML entity, and replace it with the character it actually corresponds with: U+00A0. (It's a non-breaking space, so I can't exactly "show" you. You get the string: "4 minutes". Encode this as UTF-8, and you get the following byte sequence:
characters: 4 [nbsp] m i n ...
bytes : 34 C2 A0 6D 69 6E ...
(我使用上面的意思是一个字面的不间断空间(字符,而不是HTML实体& nbsp;
,但是表示的字符,只是空格,因此很困难。)请注意,[/ U + 00A0(不间断空格)需要2个字节来编码UTF-8。
(I'm using [nbsp] above to mean a literal non-breaking space (the character, not the HTML entity
, but the character that represents. It's just white-space, and thus, difficult.) Note that the [nbsp]/U+00A0 (non-breaking space) takes 2 bytes to encode in UTF-8.
现在,要从字节流返回到可读文本,我们应该使用UTF-8进行解码,因为这是我们编码的让我们使用ISO-8859-1(latin1) - 如果你使用错误的,这几乎总是这样。
Now, to go from byte stream back to readable text, we should decode using UTF-8, since that's what we encoded in. Let us use ISO-8859-1 ("latin1") - if you use the wrong one, this is almost always it.
bytes : 34 C2 A0 6D 69 6E ...
characters: 4 Â [nbsp] m i n ...
并切换原始非breaki将空格纳入其HTML实体表示,并获得您的所有内容。
And switch the raw non-breaking space into its HTML entity representation, and you get what you have.
所以,你的PHP东西是在错误的字符集中解释你的文本,你需要告诉它否则,或者你输出的结果不知何故在错误的字符集。更多的代码在这里会很有用 - 你在哪里获取你传递给这个loadHTML的数据,以及如何获取你看到的输出?
So, either your PHP stuff is interpreting your text in the wrong character set, and you need to tell it otherwise, or you are outputting the result somehow in the wrong character set. More code would be useful here -- where are you getting the data you're passing to this loadHTML, and how are you going about getting the output you're seeing?
一些背景:字符编码只是从一系列字符到一系列字节的一种方法。什么字节代表é? UTF-8表示 C3 A9
,而ISO-8859-1则表示 E9
。要从一系列字节返回原始文本,我们必须知道我们编码的内容。如果我们将 C3 A9
解码为UTF-8数据,那么我们得到é,如果我们(错误地)将其解码为ISO-8859-1,我们得到Ã 。垃圾。在伪代码中:
Some background: A "character encoding" is just a means of going from a series of characters, to a series of bytes. What bytes represent "é"? UTF-8 says C3 A9
, whereas ISO-8859-1 says E9
. To get the original text back from a series of bytes, we must know what we encoded it with. If we decode C3 A9
as UTF-8 data, we get "é" back, if we (mistakenly) decode it as ISO-8859-1, we get "é". Junk. In psuedo-code:
utf8-decode ( utf8-encode ( text-data ) ) // OK
iso8859_1-decode ( iso8859_1-encode ( text-data ) ) // OK
iso8859_1-decode ( utf8-encode ( text-data ) ) // Fails
utf8-decode ( iso8859_1-encode ( text-data ) ) // Fails
这不是PHP代码,而不是你的修复...这只是问题的症结所在。在某种程度上,在大规模的情况下,这种情况正在发生,而且事情也让人困惑。
This isn't PHP code, and isn't your fix... it's just the crux of the problem. Somewhere, over the large scale, that's happening, and things are confused.
这篇关于为什么我不能摆脱这个Â& nbsp;?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!