序列标注:NER 与 POS Tagging 实战
1. 技术分析
1.1 序列标注任务概述
序列标注是为序列中每个元素分配标签的任务:
序列标注任务类型 POS Tagging: 词性标注 NER: 命名实体识别 Chunking: 短语切分 BIOES: 实体边界标注
1.2 标注格式对比
| 格式 | 描述 | 示例 |
|---|
| BIO | 开始、内部、外部 | B-PER, I-PER, O |
| BIOES | 开始、内部、外部、结束、单字 | B-PER, I-PER, E-PER, S-PER, O |
| IOBES | 同 BIOES | 同上 |
1.3 序列标注模型对比
| 模型 | 特点 | 效果 | 适用场景 |
|---|
| HMM | 概率模型 | 中 | 小规模 |
| CRF | 条件随机场 | 高 | 中等规模 |
| BiLSTM-CRF | 深度学习 | 很高 | 大规模 |
| BERT-CRF | 预训练+CRF | 最高 | 大规模 |
2. 核心功能实现
2.1 CRF 实现
import torch import torch.nn as nn import torch.nn.functional as F class CRF(nn.Module): def __init__(self, num_tags): super().__init__() self.num_tags = num_tags self.transitions = nn.Parameter(torch.randn(num_tags, num_tags)) self.start_transitions = nn.Parameter(torch.randn(num_tags)) self.end_transitions = nn.Parameter(torch.randn(num_tags)) def forward(self, emissions, tags, mask=None): log_likelihood = self._compute_log_likelihood(emissions, tags, mask) return log_likelihood def _compute_log_likelihood(self, emissions, tags, mask): batch_size, seq_len, num_tags = emissions.size() mask = mask if mask is not None else torch.ones(batch_size, seq_len, dtype=torch.bool) score = torch.zeros(batch_size, device=emissions.device) for i in range(seq_len): if i == 0: score += self.start_transitions[tags[:, i]] + emissions[:, i, tags[:, i]] else: score += self.transitions[tags[:, i-1], tags[:, i]] + emissions[:, i, tags[:, i]] score *= mask[:, i] score += self.end_transitions[tags[:, -1]] log_partition = self._compute_log_partition(emissions, mask) return torch.sum(score - log_partition) def _compute_log_partition(self, emissions, mask): batch_size, seq_len, num_tags = emissions.size() alpha = self.start_transitions + emissions[:, 0] for i in range(1, seq_len): alpha = (alpha.unsqueeze(1) + self.transitions + emissions[:, i].unsqueeze(0)).logsumexp(dim=0) return (alpha + self.end_transitions).logsumexp(dim=1).sum() def decode(self, emissions, mask=None): batch_size, seq_len, num_tags = emissions.size() if mask is None: mask = torch.ones(batch_size, seq_len, dtype=torch.bool) scores = torch.zeros(batch_size, num_tags, device=emissions.device) backpointers = torch.zeros(batch_size, seq_len, num_tags, dtype=torch.long, device=emissions.device) scores = self.start_transitions + emissions[:, 0] for i in range(1, seq_len): scores, backpointers[:, i] = (scores.unsqueeze(1) + self.transitions).max(dim=0) scores += emissions[:, i] scores += self.end_transitions best_tags = [] for i in range(batch_size): best_tag = scores[i].argmax().item() best_tags.append([best_tag]) for j in range(seq_len - 1, 0, -1): best_tag = backpointers[i, j, best_tag].item() best_tags[i].append(best_tag) best_tags[i] = best_tags[i][::-1] return best_tags
2.2 BiLSTM-CRF 实现
class BiLSTMCRF(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim, num_tags): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2, bidirectional=True, batch_first=True) self.fc = nn.Linear(hidden_dim, num_tags) self.crf = CRF(num_tags) def forward(self, x, tags=None, mask=None): x = self.embedding(x) x, _ = self.lstm(x) emissions = self.fc(x) if tags is not None: loss = -self.crf(emissions, tags, mask) return loss else: predictions = self.crf.decode(emissions, mask) return predictions class BertCRF(nn.Module): def __init__(self, model_name, num_tags): super().__init__() from transformers import BertModel self.bert = BertModel.from_pretrained(model_name) self.fc = nn.Linear(self.bert.config.hidden_size, num_tags) self.crf = CRF(num_tags) def forward(self, input_ids, attention_mask, tags=None): outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask) emissions = self.fc(outputs.last_hidden_state) if tags is not None: mask = attention_mask.bool() loss = -self.crf(emissions, tags, mask) return loss else: mask = attention_mask.bool() predictions = self.crf.decode(emissions, mask) return predictions
2.3 序列标注训练与评估
class SequenceLabelingTrainer: def __init__(self, model, optimizer, scheduler): self.model = model self.optimizer = optimizer self.scheduler = scheduler def train_step(self, batch): self.optimizer.zero_grad() input_ids = batch['input_ids'] tags = batch['tags'] mask = batch.get('mask') loss = self.model(input_ids, tags, mask) loss.backward() self.optimizer.step() self.scheduler.step() return loss.item() def evaluate(self, dataloader): self.model.eval() predictions = [] labels = [] with torch.no_grad(): for batch in dataloader: input_ids = batch['input_ids'] tags = batch['tags'] mask = batch.get('mask') preds = self.model(input_ids, mask=mask) for i in range(len(preds)): seq_len = mask[i].sum().item() if mask is not None else len(preds[i]) predictions.extend(preds[i][:seq_len]) labels.extend(tags[i][:seq_len].tolist()) return self._compute_metrics(predictions, labels) def _compute_metrics(self, predictions, labels): from sklearn.metrics import classification_report return classification_report(labels, predictions)
3. 性能对比
3.1 序列标注模型对比
| 模型 | F1分数 | 训练时间 | 推理速度 | 数据需求 |
|---|
| HMM | 75% | 快 | 很快 | 小 |
| CRF | 85% | 中 | 快 | 中 |
| BiLSTM-CRF | 92% | 慢 | 中 | 中 |
| BERT-CRF | 96% | 很慢 | 慢 | 大 |
3.2 不同数据集的表现
| 数据集 | 规模 | CRF | BiLSTM-CRF | BERT-CRF |
|---|
| CoNLL-2003 | 40K | 89% | 94% | 97% |
| OntoNotes | 150K | 92% | 95% | 98% |
| 中文NER | 10K | 85% | 90% | 94% |
3.3 标注格式影响
| 格式 | 边界识别 | 单实体 | 嵌套实体 |
|---|
| BIO | 中 | 好 | 差 |
| BIOES | 好 | 好 | 中 |
| IOB2 | 中 | 好 | 差 |
4. 最佳实践
4.1 序列标注模型选择
def select_sequence_labeler(data_size, language): if data_size < 1000: return CRFModel() elif data_size < 10000: return BiLSTMCRF(vocab_size=10000, embedding_dim=100, hidden_dim=200, num_tags=10) else: model_name = 'bert-base-chinese' if language == 'chinese' else 'bert-base-cased' return BertCRF(model_name, num_tags=10) class SequenceLabelerFactory: @staticmethod def create(config): if config['type'] == 'crf': return CRFModel(**config['params']) elif config['type'] == 'bilstm_crf': return BiLSTMCRF(**config['params']) elif config['type'] == 'bert_crf': return BertCRF(**config['params'])
4.2 序列标注数据处理
class SequenceLabelingDataProcessor: def __init__(self, tokenizer, tag_map): self.tokenizer = tokenizer self.tag_map = tag_map self.num_tags = len(tag_map) def process(self, texts, tags): processed = [] for text, tag_sequence in zip(texts, tags): tokens = self.tokenizer.tokenize(text) input_ids = self.tokenizer.convert_tokens_to_ids(tokens) aligned_tags = [] token_idx = 0 for char, tag in zip(text, tag_sequence): if token_idx >= len(tokens): break if tokens[token_idx].startswith('##'): aligned_tags.append(tag) else: aligned_tags.append(tag) if not tokens[token_idx].startswith('##'): token_idx += 1 aligned_tags += ['O'] * (len(tokens) - len(aligned_tags)) processed.append({ 'input_ids': input_ids, 'tags': [self.tag_map[t] for t in aligned_tags], 'mask': [1] * len(input_ids) }) return processed
5. 总结
序列标注是 NLP 重要任务:
- CRF:经典模型,适合小规模数据
- BiLSTM-CRF:深度学习模型,效果较好
- BERT-CRF:预训练+CRF,效果最佳
- 标注格式:BIOES 比 BIO 更准确
对比数据如下:
- BERT-CRF 比 CRF 提升约 10% F1
- 中文 NER 需要针对分词进行特殊处理
- 推荐在大规模数据上使用 BERT-CRF
- 小数据场景 CRF 可能更稳定