本文介绍了从 url 中提取 TLD 并对每个 TLD 文件的域和子域进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含数百万个网址的列表.我需要为每个 url 提取 TLD 并为每个 TLD 创建多个文件.例如,收集所有带有 .com 的 url 作为 tld 并将其转储到 1 个文件中,另一个文件 .edu tld 等等.进一步在每个文件中,我必须按域的字母顺序排序,然后按子域等.
I have a list of million urls.I need to extract the TLD for each url and create multiple files for each TLD.For example collect all urls with .com as tld and dump that in 1 file, another file for .edu tld and so on.Further within each file, I have to sort it alphabetically by domains and then by subdomains etc.
谁能给我一个在 perl 中实现它的先机?
Can anyone give me a head start for implementing this in perl?
推荐答案
- 使用 URI 解析 URL,
- 使用其
host
方法获取主机, - 使用Domain::PublicSuffix的
get_root_domain
解析主机名. - 使用
tld
或suffix
方法获取真实 TLD 或伪 TLD.
- Use URI to parse the URL,
- Use its
host
method to get the host, - Use Domain::PublicSuffix's
get_root_domain
to parse the host name. - Use the
tld
orsuffix
method to get the real TLD or the pseudo TLD.
use feature qw( say );
use Domain::PublicSuffix qw( );
use URI qw( );
my $dps = Domain::PublicSuffix->new();
for (qw(
http://www.google.com/
http://www.google.co.uk/
)) {
my $url = $_;
# Treat relative URLs as absolute URLs with missing http://.
$url = "http://$url" if $url !~ /^w+:/;
my $host = URI->new($url)->host();
$host =~ s/.z//; # D::PS doesn't handle "domain.com.".
$dps->get_root_domain($host)
or die $dps->error();
say $dps->tld(); # com uk
say $dps->suffix(); # com co.uk
}
这篇关于从 url 中提取 TLD 并对每个 TLD 文件的域和子域进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!