问题描述
在WeasyPrint的公共API中,我接受HTML输入的文件名或URL(以及其他类型):
In WeasyPrint’s public API I accept either filenames or URLs (among other types) for the HTML input:
document = HTML(filename='/foo/bar/baz.html')
document = HTML(url='http://example.net/bar/baz.html')
还可以选择不命名该参数,而让WeasyPrint猜测其类型:
There is also the option not to name the argument and let WeasyPrint guess its type:
document = HTML(sys.argv[1])
在某些情况下很容易:如果在Unix上以/
开头,则为文件名;如果以http://
开头,则可能为URL.但是,我们需要一种通用的算法来为任何字符串给出答案.
Some cases are easy: if it starts with a /
on Unix it’s a filename, if it starts with http://
it’s probably an URL. But we need an general algorithm that gives an answer for any string.
当前,我尝试匹配此正则表达式:^([a-z][a-z0-1.+-]*):
.根据 RFC 3986(URI)匹配的字符串以有效的URI方案开头.这在Unix上还不错,但是在Windows上却完全失败:C:\foo\bar.html
匹配并且被视为URL.
Currently I try to match this regexp: ^([a-z][a-z0-1.+-]*):
. A string that matches starts with a valid URI scheme according to RFC 3986 (URI). This is not bad on Unix, but utterly fails on Windows: C:\foo\bar.html
matches and is treated like an URL.
我可以在正则表达式中将*
更改为+
,并且仅匹配至少两个字符长的URI方案.显然,没有比这更短的已知URI方案.
I could change the *
to +
in the regexp and only match URI schemes that are at least two characters long. Apparently there is no known URI scheme shorter than that.
还是有更好的标准?也许我应该只将猜测的" URL限制为少数方案.更特殊的情况下仍然可以使用HTML(url=foo)
.
Or is there a better criteria? Maybe I should just restrict "guessed" URLs to a handful of schemes. More exotic cases can still use HTML(url=foo)
.
url.startswith(['http:', 'https:', 'ftp:', 'data:'])
推荐答案
如果您真的必须在文件名和URL之间进行猜测,那么我会说一个包含2个或更多单词字符的字符串,然后冒号是一个URL,还有其他内容是一个文件,正如您所建议的那样.
If you really must guess well between filenames and URLs, I'd say a string with 2 or more word characters and then a colon was a URL, anything else is a file, just as you suggest.
另一个选项:尝试将其作为文件打开.如果失败,请尝试将其作为URL打开.
Another option: try to open it as a file. If it fails, try to open it as a URL.
更好的办法可能是聆听Python的Zen,抵制猜测的诱惑".呼叫者不知道他在说文件名还是URL?让他们指定它.
Better might be to listen to the Zen of Python, "resist the temptation to guess". Doesn't the caller know if he's talking about a filename or a URL? Have them specify it.
这篇关于从URL区分文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!