23万条数据集，可以用来区分钓鱼网站！

文章目录

一、何为钓鱼网站？

在数字化时代，网络安全问题日益严重，其中钓鱼网站是一种常见的网络威胁。钓鱼网站通常会伪装成合法的网站，诱骗用户输入敏感信息，如用户名、密码、银行账户等，从而盗取用户的个人信息和资产。为了保护自己的网络安全，我们需要学会识别钓鱼网站。

钓鱼网站通常具有以下特点：

域名与正规网站相似，但可能包含拼写错误或特殊字符。
页面布局和正规网站相似，但可能存在细微差异。
网站可能要求您提供敏感信息，如用户名、密码、银行账户等。

今天分享来自 UCI机器学习存储库 的 PhiUSIIL Phishing URL Dataset 数据集。

二、数据集介绍

PhiUSIIL Phishing URL Dataset 是一份大小为100M左右的csv文件，我们可以用pandas来读取数据。

引用数据集

如果需要在论文中使用数据集，请这样引用：

Prasad,Arvind and Chandra,Shalini. (2024). PhiUSIIL Phishing URL. UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545.

BibTeX 这样引用:

@misc{misc_phiusiil_phishing_url_967,
  author       = {Prasad,Arvind and Chandra,Shalini},
  title        = {{PhiUSIIL Phishing URL}},
  year         = {2024},
  howpublished = {UCI Machine Learning Repository},
  note         = {{DOI}: https://doi.org/10.1016/j.cose.2023.103545}
}

数据展示

数据集有235795行，56列。

随机展示5条数据如下：

23万条数据集，可以用来区分钓鱼网站！-LMLPHP

字段解释

label=0 对应合法URL，label=1 对应网络钓鱼URL
可以忽略列“FILENAME”。

详细字段介绍如下:

23万条数据集，可以用来区分钓鱼网站！-LMLPHP

三、数据分析

数据读取

建议使用jupyter notebook，如何使用jupyter notebook 可以看这篇文章

import pandas as pd
df = pd.read_csv("./PhiUSIIL_Phishing_URL_Dataset.csv")
# 随机查看5条数据
df.sample(5)
# 查看数据维度
df.shape
# 查看数据信息
df.info()

使用ucimlrepo读取数据

速度比较慢，建议从官网下载数据集

安装 ucimlrepo 库

pip install ucimlrepo

读取数据

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
phiusiil_phishing_url = fetch_ucirepo(id=967) 
  
# data (as pandas dataframes) 
X = phiusiil_phishing_url.data.features 
y = phiusiil_phishing_url.data.targets 
  
# metadata 
print(phiusiil_phishing_url.metadata) 
  
# variable information 
print(phiusiil_phishing_url.variables)

四、下载地址

http://archive.ics.uci.edu/static/public/967/phiusiil+phishing+url+dataset.zip

帅帅的Python