本文介绍了如何提取“域"从电子邮件地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的专栏中有以下模式
[email protected]
[email protected]
现在,我想在 @
之后和 之前提取文本.
即 gmail 和 hotmail.我能够在 .
之后提取文本以下代码.
Now, I want to extract text after @
and before .
i.e gmail and hotmail .I am able to extract text after .
with following code.
sub(".*@", "", email)
如何修改上面的内容以适合我的用例?
How can I modify above to fit in my use case?
推荐答案
您:
- 真的需要阅读 RFC 3696 (TLDR:
@
可以出现在多个地方) - 似乎没有考虑到电子邮件可以是
[email protected]
"、[email protected]
">"(即天真地假设只有一个域可以在此分析中的某个时刻回来咬你) - 应该注意,如果您真的在寻找电子邮件域名",那么您还必须考虑什么真正构成域名和适当的后缀.
- really need to read Section 3 of RFC 3696 (TLDR: the
@
can appear in multiple places) - seem to not have considered that an email can be "
[email protected]
", "[email protected]
" (i.e. naively assuming only a domain could come back to bite you at some point in this analysis) - should be aware that if you're really looking for the email "domain name" then you also have to consider what really constitutes a domain name and a proper suffix.
所以——除非你确定你有并且永远会有简单的电子邮件地址——我可以建议:
So — unless you know for sure that you have and always will have simple email addresses — might I suggest:
library(stringi)
library(urltools)
library(dplyr)
library(purrr)
emails <- c("[email protected]", "[email protected]",
"[email protected]",
"[email protected]",
"[email protected]")
stri_locate_last_fixed(emails, "@")[,"end"] %>%
map2_df(emails, function(x, y) {
substr(y, x+1, nchar(y)) %>%
suffix_extract()
})
## host subdomain domain suffix
## 1 gmail.com <NA> gmail com
## 2 hotmail.com <NA> hotmail com
## 3 deparment.example.com department example com
## 4 yet.another.department.com yet.another department com
## 5 froodyco.co.uk <NA> froodyorg co.uk
注意子域、域和域的正确拆分后缀,尤其是最后一个.
Note the proper splitting of subdomain, domain & suffix, especially for the last one.
知道了这一点,我们就可以把代码改成:
Knowing this, we can then change the code to:
stri_locate_last_fixed(emails, "@")[,"end"] %>%
map2_chr(emails, function(x, y) {
substr(y, x+1, nchar(y)) %>%
suffix_extract() %>%
mutate(full_domain=ifelse(is.na(subdomain), domain, sprintf("%s.%s", subdomain, domain))) %>%
select(full_domain) %>%
flatten_chr()
})
## [1] "gmail" "hotmail"
## [3] "department.example" "yet.another.department"
## [5] "froodyorg"
这篇关于如何提取“域"从电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!