机器学习基础算法-平芜编程栈

机器学习基础算法

1. 技术分析

1.1 机器学习概述

机器学习是数据科学的核心：

机器学习类型 监督学习: 有标签数据 无监督学习: 无标签数据 半监督学习: 部分标签 强化学习: 交互学习 学习任务: 分类: 离散输出 回归: 连续输出 聚类: 分组

1.2 监督学习算法

监督学习算法 线性模型: 线性回归、逻辑回归 树模型: 决策树、随机森林 集成学习: XGBoost、LightGBM 支持向量机: SVM 神经网络: Deep Learning 算法选择: 数据规模 特征类型 任务类型

1.3 算法对比

算法	适用任务	复杂度	可解释性
线性回归	回归	低	高
逻辑回归	分类	低	高
决策树	分类/回归	中	高
随机森林	分类/回归	中	中
XGBoost	分类/回归	高	中

2. 核心功能实现

2.1 线性回归

import numpy as np class LinearRegression: def __init__(self, learning_rate=0.01, iterations=1000): self.learning_rate = learning_rate self.iterations = iterations self.weights = None self.bias = None def fit(self, X, y): n_samples, n_features = X.shape self.weights = np.zeros(n_features) self.bias = 0 for _ in range(self.iterations): y_pred = np.dot(X, self.weights) + self.bias dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) db = (1 / n_samples) * np.sum(y_pred - y) self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db def predict(self, X): return np.dot(X, self.weights) + self.bias def score(self, X, y): y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return 1 - (ss_res / ss_tot)

2.2 逻辑回归

class LogisticRegression: def __init__(self, learning_rate=0.01, iterations=1000): self.learning_rate = learning_rate self.iterations = iterations self.weights = None self.bias = None def _sigmoid(self, z): return 1 / (1 + np.exp(-z)) def fit(self, X, y): n_samples, n_features = X.shape self.weights = np.zeros(n_features) self.bias = 0 for _ in range(self.iterations): z = np.dot(X, self.weights) + self.bias y_pred = self._sigmoid(z) dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) db = (1 / n_samples) * np.sum(y_pred - y) self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db def predict_proba(self, X): z = np.dot(X, self.weights) + self.bias return self._sigmoid(z) def predict(self, X, threshold=0.5): return (self.predict_proba(X) >= threshold).astype(int) def accuracy(self, X, y): predictions = self.predict(X) return np.mean(predictions == y)

2.3 决策树

class DecisionTree: def __init__(self, max_depth=None, min_samples_split=2): self.max_depth = max_depth self.min_samples_split = min_samples_split self.tree = None def _entropy(self, y): _, counts = np.unique(y, return_counts=True) probabilities = counts / len(y) return -np.sum(probabilities * np.log2(probabilities)) def _information_gain(self, X, y, feature_idx, threshold): left_mask = X[:, feature_idx] <= threshold right_mask = ~left_mask if len(y[left_mask]) == 0 or len(y[right_mask]) == 0: return 0 parent_entropy = self._entropy(y) left_entropy = self._entropy(y[left_mask]) right_entropy = self._entropy(y[right_mask]) weight_left = len(y[left_mask]) / len(y) weight_right = len(y[right_mask]) / len(y) return parent_entropy - (weight_left * left_entropy + weight_right * right_entropy) def _best_split(self, X, y): best_gain = -1 best_feature = None best_threshold = None for feature_idx in range(X.shape[1]): thresholds = np.unique(X[:, feature_idx]) for threshold in thresholds: gain = self._information_gain(X, y, feature_idx, threshold) if gain > best_gain: best_gain = gain best_feature = feature_idx best_threshold = threshold return best_feature, best_threshold, best_gain def _build_tree(self, X, y, depth=0): if len(np.unique(y)) == 1 or len(y) < self.min_samples_split: return np.bincount(y).argmax() if self.max_depth is not None and depth >= self.max_depth: return np.bincount(y).argmax() feature, threshold, gain = self._best_split(X, y) if gain == 0: return np.bincount(y).argmax() left_mask = X[:, feature] <= threshold right_mask = ~left_mask left_tree = self._build_tree(X[left_mask], y[left_mask], depth + 1) right_tree = self._build_tree(X[right_mask], y[right_mask], depth + 1) return { 'feature': feature, 'threshold': threshold, 'left': left_tree, 'right': right_tree } def fit(self, X, y): self.tree = self._build_tree(X, y) def _predict_sample(self, x, tree): if not isinstance(tree, dict): return tree if x[tree['feature']] <= tree['threshold']: return self._predict_sample(x, tree['left']) else: return self._predict_sample(x, tree['right']) def predict(self, X): return np.array([self._predict_sample(x, self.tree) for x in X])

2.4 随机森林

class RandomForest: def __init__(self, n_trees=100, max_depth=None, min_samples_split=2): self.n_trees = n_trees self.max_depth = max_depth self.min_samples_split = min_samples_split self.trees = [] def fit(self, X, y): n_samples = X.shape[0] for _ in range(self.n_trees): indices = np.random.choice(n_samples, n_samples, replace=True) X_sample = X[indices] y_sample = y[indices] tree = DecisionTree(max_depth=self.max_depth, min_samples_split=self.min_samples_split) tree.fit(X_sample, y_sample) self.trees.append(tree) def predict(self, X): predictions = np.array([tree.predict(X) for tree in self.trees]) return np.array([np.bincount(preds).argmax() for preds in predictions.T])

3. 性能对比

3.1 监督学习算法对比

算法	训练速度	预测速度	精度
线性回归	快	快	中
逻辑回归	快	快	中
决策树	中	快	中
随机森林	慢	中	高
XGBoost	慢	中	很高

3.2 算法选择指南

数据规模	推荐算法	原因
小数据(<1万)	决策树/RF	稳定
中等数据(1万-100万)	XGBoost	平衡
大数据(>100万)	LightGBM	高效

3.3 模型评估指标

任务	指标	说明
分类	准确率	正确预测比例
分类	F1分数	平衡精确率和召回率
回归	RMSE	均方根误差
回归	R²	解释方差比例

4. 最佳实践

4.1 算法选择流程

def choose_algorithm(X, y, task_type='classification'): n_samples, n_features = X.shape if n_samples < 1000: if task_type == 'classification': return 'LogisticRegression' else: return 'LinearRegression' elif n_samples < 100000: return 'XGBoost' else: return 'LightGBM'

4.2 模型训练流程

def train_model(X_train, y_train, X_test, y_test, model): model.fit(X_train, y_train) y_pred_train = model.predict(X_train) y_pred_test = model.predict(X_test) if hasattr(model, 'predict_proba'): train_proba = model.predict_proba(X_train)[:, 1] test_proba = model.predict_proba(X_test)[:, 1] print(f"训练集准确率: {np.mean(y_pred_train == y_train):.4f}") print(f"测试集准确率: {np.mean(y_pred_test == y_test):.4f}") return model