动手学数据分析_Task05

第三章:模型建立和评估

特征工程

任务一:缺失值填充

对分类变量缺失值:填充某个缺失值字符(NA)、用最多类别的进行填充
对连续变量缺失值:填充均值、中位数、众数

对分类变量进行填充
train[‘Cabin’] = train[‘Cabin’].fillna(‘NA’)
train[‘Embarked’] = train[‘Embarked’].fillna(‘S’)

对连续变量进行填充
train[‘Age’] = train[‘Age’].fillna(train[‘Age’].mean())

检查缺失值比例
train.isnull().sum().sort_values(ascending=False)

任务三:编码分类变量

取出所有的输入特征
data = train[[‘Pclass’,‘Sex’,‘Age’,‘SibSp’,‘Parch’,‘Fare’, ‘Embarked’]]

进行虚拟变量转换
data = pd.get_dummies(data)

模型搭建

任务一:切割训练集和测试集
from sklearn.model_selection import train_test_split

一般先取出X和y后再切割,有些情况会使用到未切割的,这时候X和y就可以用
X = data
y = train[‘Survived’]

对数据集进行切割
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

查看数据形状
X_train.shape, X_test.shape

任务二:模型创建

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

默认参数逻辑回归模型
lr = LogisticRegression()
lr.fit(X_train, y_train)

查看训练集和测试集score值
print(“Training set score: {:.2f}”.format(lr.score(X_train, y_train)))
print(“Testing set score: {:.2f}”.format(lr.score(X_test, y_test)))

调整参数后的逻辑回归模型
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)

print(“Training set score: {:.2f}”.format(lr2.score(X_train, y_train)))
print(“Testing set score: {:.2f}”.format(lr2.score(X_test, y_test)))

默认参数的随机森林分类模型
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

print(“Training set score: {:.2f}”.format(rfc.score(X_train, y_train)))
print(“Testing set score: {:.2f}”.format(rfc.score(X_test, y_test)))

调整参数后的随机森林分类模型
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)

print(“Training set score: {:.2f}”.format(rfc2.score(X_train, y_train)))
print(“Testing set score: {:.2f}”.format(rfc2.score(X_test, y_test)))

任务三:输出模型预测结果

预测标签
pred = lr.predict(X_train)

此时我们可以看到0和1的数组
pred[:10]

预测标签概率
pred_proba = lr.predict_proba(X_train)

模型评估

任务一:交叉验证

from sklearn.model_selection import cross_val_score

lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

k折交叉验证分数
scores

平均交叉验证分数
print(“Average cross-validation score: {:.2f}”.format(scores.mean()))

任务二:混淆矩阵

from sklearn.metrics import confusion_matrix

训练模型
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

模型预测结果
pred = lr.predict(X_train)

混淆矩阵
confusion_matrix(y_train, pred)

from sklearn.metrics import classification_report

精确率、召回率以及f1-score
print(classification_report(y_train, pred))

任务三:ROC曲线

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label=“ROC Curve”)
plt.xlabel(“FPR”)
plt.ylabel(“TPR (recall)”)
找到最接近于0的阈值
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], ‘o’, markersize=10, label=“threshold zero”, fillstyle=“none”, c=‘k’, mew=2)
plt.legend(loc=4)