本文介绍了从字符串中删除 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串向量——myStrings——在 R 中看起来像:

I have a vector of strings—myStrings—in R that look something like:

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

其中another url 是一个有效的http url,但stackoverflow 不会让我插入多个url,这就是为什么我要写another url.我想从 myStrings 中删除所有 url 看起来像:

where another url is a valid http url but stackoverflow will not let me insert more than one url thats why i'm writing another url instead. I want to remove all the urls from myStrings to look like:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

我尝试了 stringr 包中的许多函数,但没有任何效果.

I've tried many functions in the stringr package but nothing works.

推荐答案

您可以使用带有正则表达式的 gsub 来匹配 URL,

You can use gsub with a regular expression to match URLs,

设置向量:

x <- c(
    "download file from http://example.com",
    "this is the link to my website http://example.com",
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

从每个字符串中删除所有 URL:

Remove all the URLs from each string:

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"
# [5] "And"

更新:最好能发布几个不同的 URL,以便我们知道我们正在使用什么.但我认为这个正则表达式适用于您在评论中提到的网址:

Update: It would be best if you could post a few different URLs so we know what we're working with. But I think this regular expression will work for the URLs you mentioned in the comments:

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

上面的表达式解释:

  • ? 可选空格
  • (f|ht) 匹配 "f""ht"
  • tp 匹配 "tp"
  • (s?) 可选匹配 "s" 如果它在那里
  • (://) 匹配 "://"
  • (.*) 匹配每个字符(一切)直到
  • [.|/] 句点或正斜杠
  • (.*) 之后的所有内容
  • ? optional space
  • (f|ht) match "f" or "ht"
  • tp match "tp"
  • (s?) optionally match "s" if it's there
  • (://) match "://"
  • (.*) match every character (everything) up to
  • [.|/] a period or a forward-slash
  • (.*) then everything after that

我不是正则表达式方面的专家,但我认为我的解释是正确的.

I'm not an expert with regular expressions, but I think I explained that correctly.

注意:在 SO 答案中不再允许使用 url 缩短器,因此我在进行最近的编辑时被迫删除了一个部分.查看该部分的编辑历史.

Note: url shorteners are no longer allowed in SO answers, so I was forced to remove a section while making my most recent edit. See edit history for that part.

这篇关于从字符串中删除 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 10:46