问题描述
我试图从wepage中获取表格的内容。我只是需要内容而不是标签< tr>< / tr>
。我甚至不需要tr或td只是内容。例如:
< td>我只想要这个< / td>
< tr>并且这也是< / tr>
< TABLE>只有标签之间的文本/数字而不是标签。 < /表>
我也想把这样的第一列输出放到一个新的csv文件中
我尝试过sed删除模式< tr> / code>
会去掉所有的标签,但是你可能想用一个空格替换它们,因此彼此相邻的标签不能一起运行:< td>
但是当我获取表格时,还有其他标签,如< color>
< span>
等等,所以我想要的是删除所有的标签;总之所有<和>。 '< td> one< / td>< td> two< / td>
成为: onetwo
。所以你可以做 sed's /< [^>] \ +> / / g'
,所以它会输出 / code>(呃,实际上
一两个
)。
也就是说,除非你需要生文本,这听起来像是在剥离标签之后试图对数据进行一些转换,像Perl这样的脚本语言可能是一个更合适的工具来执行此操作。
由于mu太短, 对于这些事情来说非常好。
I am trying to fetch contents of table from a wepage. I jsut need the contents but not the tags <tr></tr>
. I don't even need "tr" or "td" just the content. for eg:
<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>
also I would like to put the first column output like this in a new csv filecolumn1,info1,info2,info3coumn2,info1,info2,info3
I tried sed to deleted patters <tr>
<td>
but when I fetch table there are also other tags like <color>
<span>
etc. so I want is to delete all the tags; in short everything with < and > .
sed 's/<[^>]\+>//g'
will strip all tags out, but you might want to replace them with a space so tags that are next to each other don't run together: <td>one</td><td>two</td>
becoming: onetwo
. So you could do sed 's/<[^>]\+>/ /g'
so it would output one two
(well, actually one two
).
That said unless you need just the raw text, and it sounds like you are trying to do some transformations to the data after stripping the tags, a scripting language like Perl might be a more fitting tool to do this stuff with.
As mu is too short mentioned scraping HTML can be a bit dicey, using something that actually parses the HTML for you would be the best way to do this. PHPs DOM API is pretty good for these kinds of things.
这篇关于删除sed或类似的html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!