问题描述
我需要解析几个页面来获取他们所有的Youtube ID。
I need to parse several pages to get all of their Youtube IDs.
我在网上发现了很多正则表达式,但是:Java表达不完整(除了ID之外,他们要么给我垃圾,要么他们错过了一些ID。
I found many regular expressions on the web, but : the Java ones are not complete (they either give me garbage in addition to the IDs, or they miss some IDs).
我发现的那个似乎完整的是。但它是用JavaScript和PHP编写的。不幸的是我无法将它们翻译成JAVA。
The one that I found that seems to be complete is hosted here. But it is written in JavaScript and PHP. Unfortunately I couldn't translate them into JAVA.
有人可以帮我用Java重写这个PHP正则表达式或以下的JavaScript吗?
Can somebody help me rewrite this PHP regex or the following JavaScript one in Java?
'~
https?:// # Required scheme. Either http or https.
(?:[0-9A-Z-]+\.)? # Optional subdomain.
(?: # Group host alternatives.
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com followed by
\S* # Allow anything up to VIDEO_ID,
[^\w\-\s] # but char before ID is non-ID char.
) # End host alternatives.
([\w\-]{11}) # $1: VIDEO_ID is exactly 11 chars.
(?=[^\w\-]|$) # Assert next char is non-ID or EOS.
(?! # Assert URL is not pre-linked.
[?=&+%\w]* # Allow URL (query) remainder.
(?: # Group pre-linked alternatives.
[\'"][^<>]*> # Either inside a start tag,
| </a> # or inside <a> element text contents.
) # End recognized pre-linked alts.
) # End negative lookahead assertion.
[?=&+%\w]* # Consume any URL (query) remainder.
~ix'
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube\.com\S*[^\w\-\s])([\w\-]{11})(?=[^\w\-]|$)(?![?=&+%\w]*(?:['"][^<>]*>|<\/a>))[?=&+%\w]*/ig;
推荐答案
首先你需要插入额外的反斜杠 \
旧的正则表达式中的foreach反斜杠,否则java认为你逃脱了字符串中的其他特殊字符,你没有这样做。
First of all you need to insert and extra backslash \
foreach backslash in the old regex, else java thinks you escapes some other special characters in the string, which you are not doing.
https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*
接下来编译模式时,需要添加 CASE_INSENSITIVE
flag。这是一个例子:
Next when you compile your pattern you need to add the CASE_INSENSITIVE
flag. Here's an example:
String pattern = "https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*";
Pattern compiledPattern = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = compiledPattern.matcher(link);
while(matcher.find()) {
System.out.println(matcher.group());
}
这篇关于Youtube完整的Java Regex的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!