机器学习基础算法
1. 技术分析
1.1 机器学习概述
机器学习是数据科学的核心:
机器学习类型 监督学习: 有标签数据 无监督学习: 无标签数据 半监督学习: 部分标签 强化学习: 交互学习 学习任务: 分类: 离散输出 回归: 连续输出 聚类: 分组1.2 监督学习算法
监督学习算法 线性模型: 线性回归、逻辑回归 树模型: 决策树、随机森林 集成学习: XGBoost、LightGBM 支持向量机: SVM 神经网络: Deep Learning 算法选择: 数据规模 特征类型 任务类型1.3 算法对比
| 算法 | 适用任务 | 复杂度 | 可解释性 |
|---|---|---|---|
| 线性回归 | 回归 | 低 | 高 |
| 逻辑回归 | 分类 | 低 | 高 |
| 决策树 | 分类/回归 | 中 | 高 |
| 随机森林 | 分类/回归 | 中 | 中 |
| XGBoost | 分类/回归 | 高 | 中 |
2. 核心功能实现
2.1 线性回归
import numpy as np class LinearRegression: def __init__(self, learning_rate=0.01, iterations=1000): self.learning_rate = learning_rate self.iterations = iterations self.weights = None self.bias = None def fit(self, X, y): n_samples, n_features = X.shape self.weights = np.zeros(n_features) self.bias = 0 for _ in range(self.iterations): y_pred = np.dot(X, self.weights) + self.bias dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) db = (1 / n_samples) * np.sum(y_pred - y) self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db def predict(self, X): return np.dot(X, self.weights) + self.bias def score(self, X, y): y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return 1 - (ss_res / ss_tot)2.2 逻辑回归
class LogisticRegression: def __init__(self, learning_rate=0.01, iterations=1000): self.learning_rate = learning_rate self.iterations = iterations self.weights = None self.bias = None def _sigmoid(self, z): return 1 / (1 + np.exp(-z)) def fit(self, X, y): n_samples, n_features = X.shape self.weights = np.zeros(n_features) self.bias = 0 for _ in range(self.iterations): z = np.dot(X, self.weights) + self.bias y_pred = self._sigmoid(z) dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) db = (1 / n_samples) * np.sum(y_pred - y) self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db def predict_proba(self, X): z = np.dot(X, self.weights) + self.bias return self._sigmoid(z) def predict(self, X, threshold=0.5): return (self.predict_proba(X) >= threshold).astype(int) def accuracy(self, X, y): predictions = self.predict(X) return np.mean(predictions == y)2.3 决策树
class DecisionTree: def __init__(self, max_depth=None, min_samples_split=2): self.max_depth = max_depth self.min_samples_split = min_samples_split self.tree = None def _entropy(self, y): _, counts = np.unique(y, return_counts=True) probabilities = counts / len(y) return -np.sum(probabilities * np.log2(probabilities)) def _information_gain(self, X, y, feature_idx, threshold): left_mask = X[:, feature_idx] <= threshold right_mask = ~left_mask if len(y[left_mask]) == 0 or len(y[right_mask]) == 0: return 0 parent_entropy = self._entropy(y) left_entropy = self._entropy(y[left_mask]) right_entropy = self._entropy(y[right_mask]) weight_left = len(y[left_mask]) / len(y) weight_right = len(y[right_mask]) / len(y) return parent_entropy - (weight_left * left_entropy + weight_right * right_entropy) def _best_split(self, X, y): best_gain = -1 best_feature = None best_threshold = None for feature_idx in range(X.shape[1]): thresholds = np.unique(X[:, feature_idx]) for threshold in thresholds: gain = self._information_gain(X, y, feature_idx, threshold) if gain > best_gain: best_gain = gain best_feature = feature_idx best_threshold = threshold return best_feature, best_threshold, best_gain def _build_tree(self, X, y, depth=0): if len(np.unique(y)) == 1 or len(y) < self.min_samples_split: return np.bincount(y).argmax() if self.max_depth is not None and depth >= self.max_depth: return np.bincount(y).argmax() feature, threshold, gain = self._best_split(X, y) if gain == 0: return np.bincount(y).argmax() left_mask = X[:, feature] <= threshold right_mask = ~left_mask left_tree = self._build_tree(X[left_mask], y[left_mask], depth + 1) right_tree = self._build_tree(X[right_mask], y[right_mask], depth + 1) return { 'feature': feature, 'threshold': threshold, 'left': left_tree, 'right': right_tree } def fit(self, X, y): self.tree = self._build_tree(X, y) def _predict_sample(self, x, tree): if not isinstance(tree, dict): return tree if x[tree['feature']] <= tree['threshold']: return self._predict_sample(x, tree['left']) else: return self._predict_sample(x, tree['right']) def predict(self, X): return np.array([self._predict_sample(x, self.tree) for x in X])2.4 随机森林
class RandomForest: def __init__(self, n_trees=100, max_depth=None, min_samples_split=2): self.n_trees = n_trees self.max_depth = max_depth self.min_samples_split = min_samples_split self.trees = [] def fit(self, X, y): n_samples = X.shape[0] for _ in range(self.n_trees): indices = np.random.choice(n_samples, n_samples, replace=True) X_sample = X[indices] y_sample = y[indices] tree = DecisionTree(max_depth=self.max_depth, min_samples_split=self.min_samples_split) tree.fit(X_sample, y_sample) self.trees.append(tree) def predict(self, X): predictions = np.array([tree.predict(X) for tree in self.trees]) return np.array([np.bincount(preds).argmax() for preds in predictions.T])3. 性能对比
3.1 监督学习算法对比
| 算法 | 训练速度 | 预测速度 | 精度 |
|---|---|---|---|
| 线性回归 | 快 | 快 | 中 |
| 逻辑回归 | 快 | 快 | 中 |
| 决策树 | 中 | 快 | 中 |
| 随机森林 | 慢 | 中 | 高 |
| XGBoost | 慢 | 中 | 很高 |
3.2 算法选择指南
| 数据规模 | 推荐算法 | 原因 |
|---|---|---|
| 小数据(<1万) | 决策树/RF | 稳定 |
| 中等数据(1万-100万) | XGBoost | 平衡 |
| 大数据(>100万) | LightGBM | 高效 |
3.3 模型评估指标
| 任务 | 指标 | 说明 |
|---|---|---|
| 分类 | 准确率 | 正确预测比例 |
| 分类 | F1分数 | 平衡精确率和召回率 |
| 回归 | RMSE | 均方根误差 |
| 回归 | R² | 解释方差比例 |
4. 最佳实践
4.1 算法选择流程
def choose_algorithm(X, y, task_type='classification'): n_samples, n_features = X.shape if n_samples < 1000: if task_type == 'classification': return 'LogisticRegression' else: return 'LinearRegression' elif n_samples < 100000: return 'XGBoost' else: return 'LightGBM'4.2 模型训练流程
def train_model(X_train, y_train, X_test, y_test, model): model.fit(X_train, y_train) y_pred_train = model.predict(X_train) y_pred_test = model.predict(X_test) if hasattr(model, 'predict_proba'): train_proba = model.predict_proba(X_train)[:, 1] test_proba = model.predict_proba(X_test)[:, 1] print(f"训练集准确率: {np.mean(y_pred_train == y_train):.4f}") print(f"测试集准确率: {np.mean(y_pred_test == y_test):.4f}") return model5. 总结
机器学习算法是数据科学的核心:
- 线性模型:简单、可解释
- 树模型:处理非线性关系
- 集成学习:提高准确性
- 算法选择:根据数据规模和任务类型
对比数据如下:
- XGBoost是分类任务的首选
- 线性模型可解释性最强
- 随机森林稳定性好
- 推荐从简单模型开始逐步尝试
理解算法原理有助于更好地应用和调优模型。