从‘盲人摸象’到‘民主投票’：用Python+RandomForest轻松搞定一个分类小项目-平芜编程栈

从‘盲人摸象’到‘民主投票’：用Python+RandomForest轻松搞定一个分类小项目

想象一下，你面前有一群专家，每位都只能看到问题的某个侧面——就像盲人摸象一样。单独来看，每个人的判断可能都不全面，但如果让他们投票表决呢？这正是随机森林（Random Forest）的精妙之处。今天，我们就用Python带大家体验这个"民主决策"式的机器学习算法，完成一个完整的分类项目。

1. 环境准备与数据加载

工欲善其事，必先利其器。我们先配置好Python环境：

pip install pandas scikit-learn matplotlib

经典的鸢尾花数据集（Iris）非常适合入门实践。这个数据集包含150个样本，每个样本有4个特征（萼片长度、萼片宽度、花瓣长度、花瓣宽度），需要预测其属于3种鸢尾花中的哪一种。

from sklearn.datasets import load_iris import pandas as pd # 加载数据 iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['target'] = iris.target df['species'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'}) # 查看前5行 print(df.head())

输出示例：

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target	species
5.1	3.5	1.4	0.2	0	setosa

2. 数据探索与预处理

在建模前，我们需要了解数据的基本情况：

关键统计量查看：

print(df.describe())

类别分布检查：

import matplotlib.pyplot as plt df['species'].value_counts().plot(kind='bar') plt.title('Class Distribution') plt.show()

提示：随机森林对数据分布不敏感，但仍建议检查是否存在极端不平衡情况

特征相关性热图能直观展示特征间关系：

import seaborn as sns sns.heatmap(df.corr(), annot=True) plt.show()

3. 构建随机森林模型

现在进入核心环节——创建我们的"民主决策委员会"：

from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # 划分数据集 X = df[iris.feature_names] y = df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 初始化随机森林 rf = RandomForestClassifier( n_estimators=100, # 树的数量 max_depth=3, # 控制单棵树复杂度 random_state=42, oob_score=True # 启用OOB评估 ) # 训练模型 rf.fit(X_train, y_train)

关键参数解析：

参数	说明	典型值
n_estimators	决策树数量	50-500
max_features	每棵树考虑的最大特征数	'sqrt'或0.5-0.8
max_depth	树的最大深度	3-10
min_samples_split	节点分裂所需最小样本数	2-10
oob_score	是否使用OOB样本评估	True/False

4. 模型评估与解释

模型训练完成后，我们需要评估它的表现：

from sklearn.metrics import classification_report # 测试集预测 y_pred = rf.predict(X_test) # 打印评估报告 print(classification_report(y_test, y_pred)) print(f"OOB Score: {rf.oob_score_:.3f}")

特征重要性分析：随机森林的一个强大功能是可以量化每个特征的重要性：

importances = pd.DataFrame({ 'feature': iris.feature_names, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) print(importances) # 可视化 plt.barh(importances['feature'], importances['importance']) plt.title('Feature Importance') plt.show()

典型输出可能显示花瓣长度和宽度是最具区分力的特征。

5. 模型优化与调参

为了提高模型性能，我们可以进行参数调优：

from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [3, 5, None], 'max_features': ['sqrt', 0.8] } grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5 ) grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}") print(f"Best score: {grid_search.best_score_:.3f}")

注意：调参时建议从小范围开始，逐步扩大搜索空间以避免过度计算

6. 实际应用与部署

训练好的模型可以保存并用于新数据预测：

import joblib # 保存模型 joblib.dump(rf, 'iris_rf_model.pkl') # 加载模型 loaded_model = joblib.load('iris_rf_model.pkl') # 新样本预测示例 new_sample = [[5.1, 3.5, 1.4, 0.2]] prediction = loaded_model.predict(new_sample) print(f"Predicted class: {iris.target_names[prediction][0]}")

部署建议：