问题描述
为什么我得到:
Nokogiri::HTML('<a href="/test_$4b.html">test</a>').to_html=>"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"/test_%244b.html\">test</a></body></html>\n"
我认为 $ 符号在网址中有效?
跟进:
为什么浏览器的处理方式不同.例如.在页面中:http://www.pmlive.com/pharma_news/its_on_shire_and_abbvie_agree_32bn_takeover9_586>9
链接:http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_53 亿美元_deal_585883 有效.
但是 nokogiri 会将这个链接解析为:http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_35bnerics_in_%245.3bn_deal_5858883a> 不起作用(返回 404).
他们是否认为 $ 实际上是安全且更好的选择?
这里有 这里 RFC3986 将美元符号列为保留的子分隔符(第 12 页).
保留 = gen-delims/sub-delims
gen-delims = ":";//"/?"/#"/[";/]"/@"
sub-delims = "!";/$"/&"/'";/"("/")";/*"/+"/,"/;"/="
它还建议如何处理保留字符:
2.2.保留字符
URI 包括组件和子组件,这些组件和子组件由保留"中的字符放.这些字符被称为保留"因为它们可能(也可能不会)被定义为分隔符通用语法,通过每个方案特定的语法,或通过URI 解引用算法的特定于实现的语法.如果 URI 组件的数据与保留的字符的用途作为分隔符,那么冲突的数据必须是在 URI 形成之前进行百分比编码.
Nokogiri 的作者喜欢决定,由于他们的库可以被任何人用于任何目的,因此无法自动确定保留字符是否会发生冲突,因此是最安全"的.处理它的方法(没有直接测试 URI)是根据建议对其进行转义.
Why do I get:
Nokogiri::HTML('<a href="/test_$4b.html">test</a>').to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"/test_%244b.html\">test</a></body></html>\n"
I thought $ symbol was valid in the url?
Followup:
Why do browsers handle this differently. E.g. In the page: http://www.pmlive.com/pharma_news/its_on_shire_and_abbvie_agree_32bn_takeover_586969
The link: http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_$5.3bn_deal_585883 works.
But nokogiri would parse this link as:http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_%245.3bn_deal_585883 which does not work (returns 404).
Are they making the decision that $ is actually safe and a better choice?
There's this RFC3986 here which lists the dollar sign as a reserved sub-delimiter (page 12).
It also recommends how reserved characters should be handle:
The authors of Nokogiri liked decided that since their library may be used by anyone for any purpose, there is no way to automatically determine whether a reserved character would conflict or not, and therefore the "safest" way to handle it (short of testing a URI directly) would be to escape it as per the recommendation.
这篇关于什么是 nokogiri % 编码 $ 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!