grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | sed -e 's/^.*"\([^"]\+\)".*$/\1/g'


在网上搜寻了我的家庭作业问题的答案之后,我终于明白了。但是我不完全理解sed和grep使用的两个正则表达式的含义。有人可以帮我一下吗?提前致谢。

最佳答案

grep命令查找包含以下内容的所有行:

'<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"'


这是

<a     the characters <a
[^>]   not followed by a close '>'
\+     the last thing one or more times (this is really not necessary I think.
       with this, it would be "not followed by exactly one '>' which would be fine
href   followed by the string 'href'
[ ]*   followed by zero or more spaces (you don't really need the [], just ' *' would be enough)
=      followed by the equals sign
[ \t]* followed by zero or more space or tab ("white space")
"      followed by open quote (but only a double quote...)
\(     open bracket (grouping)
ht     characters 'ht'
\|     or
f      character f
\)     close group (of the either-or)
tp     characters 'tp'
s\?    optionally followed by s
       Note - the last few lines combined means 'http or https or ftp or ftps'
:      character :
[^"]\+ one or more characters that are not a double quote
       this is "everything until the next quote"


这会让您入门吗?您可以在接下来的步骤中做同样的事情...

请注意,这会使您感到困惑-反斜杠用于更改某些特殊字符(如()+)的含义;只是为了让每个人都保持警惕,无论这些字符是否带有反斜杠都具有特殊含义,这不是由正则表达式语法定义的,而是由使用它的命令(及其选项)定义的。例如,sed取决于是否使用-E标志来更改事物的含义。

关于regex - grep和sed正则表达式的含义-从网页中提取网址,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/22848049/

10-15 16:29