本文介绍了如何提取“域"从电子邮件地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的专栏中有以下模式

[email protected]
[email protected]

现在,我想在 @ 之后和 之前提取文本. 即 gmail 和 hotmail.我能够在 . 之后提取文本以下代码.

Now, I want to extract text after @ and before . i.e gmail and hotmail .I am able to extract text after . with following code.

sub(".*@", "", email)

如何修改上面的内容以适合我的用例?

How can I modify above to fit in my use case?

推荐答案

您:

  1. 真的需要阅读 RFC 3696 (TLDR: @ 可以出现在多个地方)
  2. 似乎没有考虑到电子邮件可以是[email protected]"、[email protected]">"(即天真地假设只有一个域可以在此分析中的某个时刻回来咬你)
  3. 应该注意,如果您真的在寻找电子邮件域名",那么您还必须考虑什么真正构成域名和适当的后缀.
  1. really need to read Section 3 of RFC 3696 (TLDR: the @ can appear in multiple places)
  2. seem to not have considered that an email can be "[email protected]", "[email protected]" (i.e. naively assuming only a domain could come back to bite you at some point in this analysis)
  3. should be aware that if you're really looking for the email "domain name" then you also have to consider what really constitutes a domain name and a proper suffix.

所以——除非你确定你有并且永远会有简单的电子邮件地址——我可以建议:

So — unless you know for sure that you have and always will have simple email addresses — might I suggest:

library(stringi)
library(urltools)
library(dplyr)
library(purrr)

emails <- c("[email protected]", "[email protected]",
            "[email protected]",
            "[email protected]",
            "[email protected]")

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_df(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract()
  })
##                         host    subdomain      domain suffix
## 1                  gmail.com         <NA>       gmail    com
## 2                hotmail.com         <NA>     hotmail    com
## 3      deparment.example.com   department     example    com
## 4 yet.another.department.com  yet.another  department    com
## 5             froodyco.co.uk         <NA>   froodyorg  co.uk

注意子域、域和域的正确拆分后缀,尤其是最后一个.

Note the proper splitting of subdomain, domain & suffix, especially for the last one.

知道了这一点,我们就可以把代码改成:

Knowing this, we can then change the code to:

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_chr(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract() %>%
      mutate(full_domain=ifelse(is.na(subdomain), domain, sprintf("%s.%s", subdomain, domain))) %>%
      select(full_domain) %>%
      flatten_chr()
  })
## [1] "gmail"                   "hotmail"
## [3] "department.example"      "yet.another.department"
## [5] "froodyorg"

这篇关于如何提取“域"从电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 13:01