深度学习与大模型第2课:机器学习实战

课程代码及数据集:https://pan.baidu.com/s/1Mp6deFtxIJVgUwmhuKLh1w?pwd=fepk

本篇博客主要分享如何利用Python和常见的机器学习库,完成一些基础的机器学习任务。涉及的内容包括经典的数据集处理、模型训练及评估。代码细节将会有所保留,以便初学者可以更好地理解整个流程。

1. Iris鸢尾花数据集分类

1.1 数据预处理

首先,我们从pandas库导入iris数据集,并对数据进行初步处理。代码如下:

import pandas as pd

# 读取数据集
iris = pd.read_csv('./iris.csv')

# 特征矩阵和标签
x = iris.drop('species', axis=1).iloc[:, :-2].values
y = iris['species']

# 划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=99)

深度学习与大模型第2课:机器学习实战-LMLPHP

1.2 模型训练与评估

使用Logistic Regression进行模型训练,并计算模型在测试集上的准确率:

from sklearn.linear_model import LogisticRegression

# 初始化模型
log_reg = LogisticRegression(solver="liblinear", random_state=66)

# 模型训练
log_reg.fit(X_train, y_train)

# 模型评估
accuracy = log_reg.score(X_test, y_test)
print(f'模型准确率: {accuracy:.4f}')

2. 美国加州房价预测

2.1 数据预处理

同样使用pandas读取房价数据,并对缺失值进行填充。这里采用SimpleImputer的中位数策略:

import pandas as pd
from sklearn.impute import SimpleImputer

# 读取数据集
housing = pd.read_csv('D:/tmp/housing.csv')

# 处理缺失值
imputer = SimpleImputer(strategy="median")
housing['total_bedrooms'] = imputer.fit_transform(housing[['total_bedrooms']])

# 特征选择
X = housing.drop(['households', 'total_bedrooms', 'population', 'longitude'], axis=1)
y = housing['median_house_value']

# 处理分类特征
X_cat = X[['ocean_proximity']]
X_num = X.drop('ocean_proximity', axis=1)

# 独热编码与标准化
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import numpy as np

X_1hot = OneHotEncoder(sparse=False).fit_transform(X_cat)
X_std = StandardScaler().fit_transform(X_num)
X_prepared = np.c_[X_std, X_1hot]

2.2 模型训练与调优

采用随机森林回归模型,并利用RandomizedSearchCV进行参数调优:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# 初始化模型
forest_reg = RandomForestRegressor()

# 参数网格
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

# 随机搜索
random_search = RandomizedSearchCV(forest_reg, param_grid, n_iter=10, cv=5, scoring='neg_mean_squared_error')
random_search.fit(X_train, y_train)

# 最佳模型
random_final_model = random_search.best_estimator_
score = random_final_model.score(X_test, y_test)
print(f'模型预测得分: {score:.4f}')

深度学习与大模型第2课:机器学习实战-LMLPHP

3. 房价数据的回归分析

3.1 数据预处理与建模

本部分读取房价数据集,进行数据预处理(包括缺失值处理、特征编码和标准化),使用决策树回归模型进行训练,并评估模型性能,同时输出预测值与实际房价的对比结果。:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# 读取数据
file_path = 'D:/tmp/kc_house.csv'
data = pd.read_csv(file_path)

# 处理数据
# 缺失值处理(简单示例,视数据而定)
data_cleaned = data.dropna()

# 删除无关列
data_cleaned = data_cleaned.drop(columns=['Unnamed: 0'])

# 特征工程:将类别变量 house_age 进行 one-hot 编码
data_cleaned = pd.get_dummies(data_cleaned, columns=['house_age'], drop_first=True)

# 特征标准化
scaler = StandardScaler()
X = data_cleaned.drop(columns=['price'])
X_scaled = scaler.fit_transform(X)
y = data_cleaned['price']

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# 构建模型
tree_reg = DecisionTreeRegressor(max_depth=4)

# 训练模型
tree_reg.fit(X_train, y_train)

# 评估模型得分
final_score = tree_reg.score(X_test, y_test)
print("模型预测得分:", final_score)

# 使用模型预测并输出对比结果
for i in range(len(X_test)):
    guest_price = tree_reg.predict([X_test[i]])
    print(f"预测价格为 {guest_price[0]:.2f}, 真实价格是 {y_test.values[i]:.2f}")

深度学习与大模型第2课:机器学习实战-LMLPHP

4. 时尚衣物识别:MNIST 数据集分类

4.1 数据读取与可视化

首先读取mnist数据集,并展示一个样本图像:

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

# 读取数据
data = pd.read_csv('D:/tmp/mnist.csv')

# 处理数据
y = data['label']
X = data.iloc[:, :-1].values

# 可视化样本数据
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

4.2 批量可视化数据

定义一个函数来绘制多个图像,并展示更多的手写数字样本:

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size, size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap=matplotlib.cm.binary, **options)
    plt.axis("off")

plt.figure(figsize=(9,9))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]] 
plot_digits(example_images, images_per_row=10)
plt.show()

4.3 模型构建与优化

使用随机森林分类器对MNIST数据集进行分类,并通过RandomizedSearchCV优化模型:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 构建随机森林分类器模型
forest_clf = RandomForestClassifier(n_estimators=100, max_depth=3, max_samples=500)
forest_clf.fit(X_train, y_train)

# 模型评估
accuracy = forest_clf.score(X_test, y_test)
print(f'初始模型准确率: {accuracy:.4f}')

# 定义参数网格进行优化
param_grid = [
    {'n_estimators': [50, 100, 200], 'max_depth': [2, 3, 4, 5]},
    {'bootstrap': [False], 'n_estimators': [50, 100], 'max_depth': [2, 3, 4]},
]

# 随机搜索
random_search = RandomizedSearchCV(forest_clf, param_grid, n_iter=10, cv=5, scoring='neg_mean_squared_error')
random_search.fit(X_train, y_train)

# 最佳模型评估
best_forest_clf = random_search.best_estimator_
final_accuracy = best_forest_clf.score(X_test, y_test)
print(f'优化后模型准确率: {final_accuracy:.4f}')

总结

通过这篇文章,我们简要介绍了如何使用Python和常见机器学习库(如Scikit-Learn、TensorFlow)进行模型训练和评估。希望这些实战示例能够帮助读者更好地理解机器学习的基本原理和操作流程。

09-02 10:24