AI项目常规入门流程(Kaggle泰坦尼克号为例)

First Post:

Last Update:

数据处理

常用的库

1
2
3
4
5
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

数据读取

1
2
train_data = pd.read_csv('./train.csv')
test_data = pd.read_csv('./test.csv')

数据分析

  • 数据输出
1
2
train_data.head()   # 输出前五行训练数据
test_data.head() # 输出前五行测试数据
1
2
3
4
print("Total number of rows in training data ", train_data.shape[0])
print("Total number of columns in training data ", train_data.shape[1])
print("Total number of rows in test data ", test_data.shape[0])
print("Total number of columns in test data ", test_data.shape[1])
  • 绘制图片表格
1
2
3
4
5
plt.figure(figsize = (13,5))
plt.bar(train_data.columns, train_data.isna().sum())
plt.xlabel("Columns name")
plt.ylabel("Number of missing values in training data") # 反映数据缺失情况
plt.show()
  • PLT的各种表格绘制

在此不多赘述,但作用不小,可以很直观地反映特征分布情况,便于后续缩小模型等操作

  • 计算数字表格
1
2
# 分组计算,以此为例,输出男女幸存者/遇难者在同性中的占比
(train_data.groupby(['Sex','Survived']).Survived.count() * 100) / train_data.groupby('Sex').Survived.count()

数据清洗

  • 删除无用信息
1
2
3
4
5
6
7
8
9
10
train_data.drop('Cabin', axis = 1, inplace = True)
test_data.drop('Cabin', axis = 1, inplace = True)

#:--------------------------------------------------:#

columns_to_drop = ['PassengerId','Ticket']
train_data.drop(columns_to_drop, axis = 1, inplace = True)

# 后续要根据乘客ID来代表幸存者,故测试集不能删除PassengerId
test_data.drop(columns_to_drop[1], axis = 1, inplace = True)
  • 查看并填充缺失值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
combined_data = [train_data, test_data]

# 输出训练集与测试集的缺失值情况
for data in combined_data:
print(data.isnull().sum())
print('*' * 20)

# 用平均值替代缺失值(年龄、票价)
for data in combined_data:
data.Age.fillna(data.Age.mean(), inplace = True)
data.Fare.fillna(data.Fare.mean(), inplace = True)

# 以泰坦尼克号数据为例,最多登船港口为Southampton,故填充缺失港口为Southampton
# 由此也可知缺失值替换不一定依赖于平均值,应该由实际情况出发
train_data['Embarked'].fillna('S', inplace = True)
  • 处理分类型特征
1
2
3
4
5
6
7
# 性别
train_data.Sex = train_data.Sex.map({'female': 0, 'male': 1})
test_data.Sex = test_data.Sex.map({'female': 0, 'male': 1})

# 登船港口
train_data.Embarked = train_data.Embarked.map({'S': 0, 'C': 1, 'Q': 2})
test_data.Embarked = test_data.Embarked.map({'S': 0, 'C': 1, 'Q': 2})

特征工程

  • 例如有无伴侣/父母/兄弟/子女随同可总结为是否独自前往
1
2
3
4
5
6
7
8
train_data['Alone'] = train_data.SibSp + train_data.Parch
test_data['Alone'] = test_data.SibSp + test_data.Parch

train_data.loc[train_data.Alone == 0, 'Alone'] = 1
test_data.loc[test_data.Alone == 0, 'Alone'] = 1

train_data.drop(['SibSp','Parch'], axis = 1, inplace = True)
test_data.drop(['SibSp','Parch'], axis = 1, inplace = True )
  • 例如人物头衔
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 输出头衔数量(采用了正则化表达式提取中间名为头衔)
train_data.Name.str.extract(' ([A-Za-z]+)\.', expand=False).unique().size

# 提取头衔,删除名字
for data in combined_data:
data['Title'] = data.Name.str.extract('([A-Za-z]+)\.', expand = False)
data.drop('Name', axis = 1, inplace = True)

# 输出头衔及数量
train_data.Title.value_counts()

# 将出现次数少的头衔一律替换为Rare(缩小模型)
least_occuring = [ 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'Countess','Dona',
'Jonkheer']
for data in combined_data:
data.Title = data.Title.replace(least_occuring, 'Rare')

# 将头衔映射为数字
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for data in combined_data:
data['Title'] = data['Title'].map(title_mapping)
  • 提取年龄特征
1
2
3
4
5
6
7
# 将年龄分组
for dataset in combined_data:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4

关于loc的选取

1
2
3
4
5
6
7
# loc 用于选取行和列

# 表示选择 df 数据集中行标签为 'a',列标签为 'b' 的单个元素
df.loc['a', 'b']

# 也可以使用表达式定位,表示选择 df 数据集中年龄大于18的乘客的名字
df.loc[df['Age'] > 18, 'Name']

数据预备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch

X_train = train_data.drop("Survived", axis=1)
Y_train = train_data["Survived"]

X_test = test_data.drop("PassengerId", axis=1)

print("shape of X_train",X_train.shape)
print("Shape of Y_train",Y_train.shape)
print("Shape of x_test",X_test.shape)

# 将数据转换为tensor
X_train = torch.tensor(X_train.values).float()
Y_train = torch.tensor(Y_train).float().view(-1, 1)
X_test = torch.tensor(X_test.values).float()

神经网络

常用库

1
2
3
import torch
import torch.nn as nn
import torch.optim as optim

网络搭建 & 训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# 定义线性回归模型类
class LinearRegression(nn.Module):
def __init__(self, input_size, output_size):
super(LinearRegression, self).__init__()

# 定义线性层
self.linear0 = nn.Linear(input_size, 42)
self.linear1 = nn.Linear(42, 168)
self.linear2 = nn.Linear(168, 84)
self.linear3 = nn.Linear(84, output_size)


self.relu = nn.ReLU()
# 定义一个sigmoid激活函数,用于将线性输出转换为概率
self.sigmoid = nn.Sigmoid()

def forward(self, x):
# 前向传播,计算模型的输出
out = self.linear0(x)
out = self.relu(out)

out = self.linear1(out)
out = self.relu(out)

out = self.linear2(out)
out = self.relu(out)

out = self.linear3(out)
out = self.sigmoid(out)

return out

# 定义超参数
input_size = 7 # 输入特征的维度
output_size = 1 # 输出概率的维度
learning_rate = 0.01 # 学习率
epochs = 200 # 训练轮数

# 创建模型实例
model = LinearRegression(input_size, output_size)

# 定义损失函数,使用二元交叉熵损失
criterion = nn.BCELoss()

# 定义优化器,使用adam优化器
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# 训练模型
for epoch in range(epochs):
# 前向传播,计算模型的输出和损失
outputs = model(X_train)
loss = criterion(outputs, Y_train)
# 反向传播,计算梯度并更新参数
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 打印每轮的损失
print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

模型评估

1
2
3
4
5
6
7
from sklearn import metrics
Y_pred_rand = (model(X_train) > 0.5).to(torch.int)
print('Precision : ', np.round(metrics.precision_score(Y_train, Y_pred_rand)*100,2))
print('Accuracy : ', np.round(metrics.accuracy_score(Y_train, Y_pred_rand)*100,2))
print('Recall : ', np.round(metrics.recall_score(Y_train, Y_pred_rand)*100,2))
print('F1 score : ', np.round(metrics.f1_score(Y_train, Y_pred_rand)*100,2))
print('AUC : ', np.round(metrics.roc_auc_score(Y_train, Y_pred_rand)*100,2))

关于各评估指标

指标 作用
精确度(Precision) 表示模型预测为正类别中实际为正类别的比例。
准确率(Accuracy) 表示模型正确预测的样本占总样本的比例。
召回率(Recall) 表示实际为正类别中模型预测为正类别的比例。
F1分数(F1 score) 是精确度和召回率的调和平均数,用于衡量模型的整体性能。
AUC 表示模型区分正负类别的能力,值越高表示模型性能越好。
1
2
3
4
# 绘制混淆矩阵的热力图
matrix = metrics.confusion_matrix(Y_train, Y_pred_rand)
sns.heatmap(matrix, annot = True,fmt = 'g')
plt.show()

预测 & 提交

模型预测

1
2
3
predict = model(X_test)
predict = (predict > 0.5).to(torch.int).ravel()
print(predict)

提交

1
2
submit = pd.DataFrame({"PassengerId":test_data.PassengerId, 'Survived':predict})
submit.to_csv("final_submission.csv",index = False) # index参数来防止在 CSV 文件中添加索引列

推荐网站

Kaggle