问题描述
我有一个文件就是,它具有约150锚标记。我只需要从这些标签,AKA的链接。我想只有的一部分。
当我运行grep命令,
猫website.htm |的grep -E'&下; A HREF = GT*;' > links.txt
这将返回整条生产线,我认为这发现不是我想要的链接,所以我尝试使用的命令:
猫drawspace.txt |的grep -E'&下; A HREF = GT*;' |切-d'--output分隔符= $的'\\ n'> links.txt
除了它是错误的,它不工作,给我关于错误的参数一定的误差。所以我认为该文件应该一起过传递。也许就像切-d'--output分隔符= $的'\\ n'grepedText.txt> links.txt
。
不过,我想这样做在一个命令,如果可能的...所以我试图做一个中的HTTP链接的正确方法?随着我将使我的情况下工作。
P.S。我读过这么多的链接/堆栈 溢出职位,显示我引用的时间太长....如果需要例如HTML显示过程的复杂性,然后我将它添加
我也有一个Mac和PC,我来回切换,它们之间用自己的壳/批号/ grep命令/终端命令,所以要么还是会帮我。
我也想指出,我在正确的目录
HTML
< TR = VALIGN顶>
< TD类=初学者>
B03&安培; NBSP;&安培; NBSP;
< / TD>
&所述; TD>
&所述; A HREF =http://www.drawspace.com/lessons/b03/simple-symmetry>简单对称性及所述; / A> < / TD>
< / TR>< TR = VALIGN顶>
< TD类=初学者>
B04&安培; NBSP;&安培; NBSP;
< / TD>
&所述; TD>
< A HREF =http://www.drawspace.com/lessons/b04/faces-and-a-vase>面和一个花瓶LT; / A> < / TD>
< / TR>< TR = VALIGN顶>
< TD类=初学者>
B05&安培; NBSP;&安培; NBSP;
< / TD>
&所述; TD>
< A HREF =http://www.drawspace.com/lessons/b05/blind-contour-drawing>盲轮廓图< / A> < / TD>
< / TR>< TR = VALIGN顶>
< TD类=初学者>
B06&安培; NBSP;&安培; NBSP;
< / TD>
&所述; TD>
< A HREF =http://www.drawspace.com/lessons/b06/seeing-values>看到价值和LT; / A> < / TD>
< / TR>
期望的输出:
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
等等
$ SED的-n /.* HREF =\\([^] * \\)。* / \\ 1 / p'文件
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, . I want to get only the http://www.google.com part.
When I run a grep,
cat website.htm | grep -E '<a href=".*">' > links.txt
this returns the entire line to me that it found on not the link I want, so I tried using a cut command:
cat drawspace.txt | grep -E '<a href=".*">' | cut -d’"’ --output-delimiter=$'\n' > links.txt
Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d’"’ --output-delimiter=$'\n' grepedText.txt > links.txt
.
But I wanted to do this in one command if possible... So I tried doing an AWK command.
cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’
But this wouldn't run either. It was asking me for more input, because I wasn't finished....
I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....
The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.
P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.
I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.
I also want to point out I'm in the correct directory
HTML:
<tr valign="top">
<td class="beginner">
B03
</td>
<td>
<a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a> </td>
</tr>
<tr valign="top">
<td class="beginner">
B04
</td>
<td>
<a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>
<tr valign="top">
<td class="beginner">
B05
</td>
<td>
<a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>
<tr valign="top">
<td class="beginner">
B06
</td>
<td>
<a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>
Expected output:
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.
$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
这篇关于如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理,并将它们存储在一个文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!