用 Python 构建服饰文案文化调性自动分类器,通过 NLP 技术识别并划分国风、法式、日系、美式四类文案,并以中立视角呈现分析结果。
一、实际应用场景描述
在《时尚产业与品牌创新》课程中,"品牌调性一致性"是核心议题。具体表现为:
- 国风文案:偏好"诗意、意境、留白"——如"一袭青衣染就江南烟雨""新中式美学,淡雅如诗"。
- 法式文案:偏好"浪漫、慵懒、精致"——如"Parisian chic, effortless elegance""法式慵懒,自带高级感"。
- 日系文案:偏好"克制、功能性、氛围感"——如"less is more 的日式哲学""干净利落的东京街头"。
- 美式文案:偏好"自信、直白、冲击力"——如"Own the room""Bold, unapologetic, iconic"。
品牌面临核心问题:
"我有 10 万条历史文案,哪些符合品牌调性?新品文案写出来是国风还是日系?AI 能否自动分类?"
二、引入痛点
- 文案调性判断多依赖人工审核,效率低且主观性强。
- 多语言混排(中英法日)的文案库缺乏统一分类框架。
- 缺乏可解释的判定依据——"为什么这条被分为法式?"无法回答。
⇒ 用 Python 构建基于关键词权重 + TF-IDF + 朴素贝叶斯的多语言文案分类器,输出分类结果 + 判定依据。
三、核心逻辑讲解
1. 分类特征工程
每类调性提取高频特征词 + 语言风格标记:
调性 核心关键词 语言风格特征
国风 诗意、留白、水墨、禅意、青瓷、烟雨、新中式 四字短语多、对仗工整、意象密集
法式 浪漫、慵懒、chic、effortless、巴黎、左岸 法语借词多、长句、形容词堆叠
日系 极简、克制、wabi-sabi、留白、功能性、东京 短句多、助词多(の/に/で)、安静感
美式 bold、iconic、fearless、statement、trendsetter 感叹号多、大写词多、短平快节奏
2. 算法选择:朴素贝叶斯
P(类别|文案) ∝ P(类别) × Π P(词|类别)
预测类别 = argmax P(类别|文案)
优势:
- 多分类场景稳定,训练速度快
- 可输出各特征词对分类的贡献度(可解释性)
- 对短文本(文案通常 10-50 词)表现良好
3. 混合语言处理策略
中文文案 → jieba 分词 → 特征提取
英文文案 → 空格分词 → 小写归一化
法文文案 → 空格分词 → 小写归一化
日文文案 → nagisa 分词 → 特征提取
统一:去停用词 → TF-IDF 向量化 → 朴素贝叶斯分类
四、代码模块化(text_style_classifier.py)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
text_style_classifier.py
服饰文案文化调性自动分类器
支持国风/法式/日系/美式四类文案识别
依赖: numpy, pandas, matplotlib, scikit-learn
安装: pip install numpy pandas matplotlib scikit-learn
"""
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from collections import Counter
# 中文字体设置
rcParams['font.sans-serif'] = ['Noto Sans CJK SC', 'SimHei', 'Microsoft YaHei']
rcParams['axes.unicode_minus'] = False
# ──────────────────────────────────────────────
# 1. 调性词典模块
# ──────────────────────────────────────────────
class StyleDictionary:
"""
服饰文案文化调性关键词词典
基于各文化审美特征与时尚传播语言习惯构建
"""
# 国风关键词
GUOFENG = [
'诗意', '意境', '留白', '水墨', '青瓷', '烟雨', '江南', '新中式',
'禅意', '东方', '古典', '雅致', '清雅', '国潮', '汉服', '旗袍',
'刺绣', '织锦', '敦煌', '飞天', '山水', '花鸟', '梅兰竹菊',
'淡雅', '温润', '古韵', '风雅', '锦绣', '丝竹', '墨色',
'poetic', 'oriental', 'chinoiserie', 'new chinese'
]
# 法式关键词
FRENCH = [
'浪漫', '慵懒', 'chic', 'effortless', '巴黎', '左岸', '马赛',
'法式', '优雅', '精致', '复古', '女人味', '慵懒', '随性',
'romantic', 'parisian', 'chic', 'bohème', 'élégant', 'sophistiqué',
'avant-garde', 'couture', 'très', 'très chic', 'je ne sais quoi',
'art de vivre', 'savoir-faire'
]
# 日系关键词
JAPANESE = [
'极简', '克制', 'wabi', 'sabi', '留白', '功能性', '东京',
'日式', '安静', '朴素', '侘寂', '枯山水', '禅', '静寂',
'minimal', 'tokyo', 'japanese', 'kimono', 'zen', 'serenity',
'transient', 'imperfect', 'aesthetic', 'harmony', 'muji',
'less is more', 'clean', 'neat'
]
# 美式关键词
AMERICAN = [
'bold', 'iconic', 'fearless', 'statement', 'trendsetter',
'confident', 'fierce', 'unapologetic', 'legendary', 'timeless',
'streetwear', 'hip-hop', 'urban', 'grunge', 'preppy',
'all-american', 'classic', 'vintage', 'retro', 'denim',
'自信', '大胆', '无畏', '标志性', '街头', '经典'
]
# 停用词(中英文通用)
STOPWORDS = {
# 中文停用词
'的', '了', '在', '是', '我', '有', '和', '就', '不', '人',
'都', '一', '一个', '上', '也', '很', '到', '说', '要', '去',
'你', '会', '着', '没有', '看', '好', '自己', '这',
# 英文停用词
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to',
'for', 'of', 'with', 'by', 'from', 'is', 'are', 'was', 'were',
'be', 'been', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
'would', 'could', 'should', 'may', 'might', 'must', 'shall',
'this', 'that', 'these', 'those', 'it', 'its', 'they', 'them',
'we', 'you', 'he', 'she', 'his', 'her', 'our', 'your', 'their',
'not', 'no', 'so', 'up', 'out', 'if', 'about', 'who', 'get',
'which', 'go', 'me', 'when', 'make', 'can', 'like', 'time',
'just', 'very', 'now', 'new', 'because', 'people', 'such',
'only', 'way', 'thing', 'every', 'after', 'between', 'under',
'before', 'never', 'always', 'something', 'anything', 'everything',
}
@classmethod
def get_all_dictionaries(cls) -> dict:
"""返回所有调性的关键词字典"""
return {
'国风': cls.GUOFENG,
'法式': cls.FRENCH,
'日系': cls.JAPANESE,
'美式': cls.AMERICAN
}
# ──────────────────────────────────────────────
# 2. 合成数据生成模块
# ──────────────────────────────────────────────
class SyntheticTextGenerator:
"""
生成模拟的服饰文案数据集
每条文案带有明确的调性标签
"""
TEMPLATES = {
'国风': [
"一袭{adj}染就{object},{poetic_end}",
"{object}之上,{adj}如诗,意境悠远",
"新中式美学,{adj}中见真章,{object}间的微妙平衡",
"淡雅如{object},{adj}而不张扬,东方韵味自在其中",
"以{object}为画布,{adj}为笔墨,书写当代国风篇章",
"青瓷般的{adj},烟雨江南的{object},一袭在身便有了诗意",
"古韵今风,{adj}与{object}的完美邂逅",
"禅意入衣,{object}间尽显东方{adj}",
"水墨晕染的{object},{adj}得恰到好处",
"温润如玉,{adj}似水,这件{object}藏着江南的秘密",
],
'法式': [
"Parisian {adj_en}, effortless {object_en} for the modern soul",
"L'art de vivre: {adj_en} {object_en} that whispers French elegance",
"Chic à la française: {adj_en} {object_en}, timeless and bold",
"Savoir-faire meets modern {object_en} — {adj_en}, always",
"Left Bank vibes: {adj_en} {object_en} with a touch of bohème",
"Très chic {object_en}, {adj_en} and unapologetically French",
"浪漫如巴黎{object},{adj}中透着法式慵懒",
"法式优雅,{adj}的{object},effortless chic 的最高境界",
"马赛的阳光洒在{object}上,{adj}得刚刚好",
"左岸咖啡馆里的{object},{adj}得让人心动",
],
'日系': [
"{object}の美しさは、{adj}さにある",
"東京の{object}、{adj}な美意識",
"Less is more: {adj_en} {object_en} with Japanese precision",
"Wabi-sabi の心で作られた{object}、{adj}で静かな美",
"枯山水のように、{adj}で{object}に宿る禅",
"Minimalist {object_en}: {adj_en}, clean, and distinctly Tokyo",
"侘寂の{object}、{adj}で不完全な美",
"日式{object}、{adj}の中に見える機能美",
"静寂の{object}、{adj}で温かい",
"Clean lines, {adj_en} {object_en} — the Japanese way",
],
'美式': [
"OWN THE ROOM in this {adj_en} {object_en}!",
"Bold. Fearless. {adj_en} {object_en} for the iconic you.",
"Street-ready {object_en} that's {adj_en}, unapologetic",
"Make a statement: {adj_en} {object_en} = pure confidence",
"All-American {object_en}: {adj_en}, classic, and absolutely legendary",
"THIS is how you do {object_en} — {adj_en} and unforgettable!",
"大胆{object},{adj}到骨子里,穿上就是街头王者",
"无畏的{object},{adj}且标志性,定义你的风格",
"自信从{object}开始,{adj}得不费吹灰之力",
"Iconic {object_en}: {adj_en}, fierce, trendsetting",
],
}
# 填充词库
ADJECTIVES_CN = {
'国风': ['淡雅', '清雅', '温润', '古韵', '雅致', '素净', '清冷',
'灵动', '飘逸', '婉约', '隽永', '空灵'],
'法式': ['慵懒', '浪漫', '精致', '优雅', '随性', '迷人', '细腻',
'妩媚', '洒脱', '从容', '梦幻'],
'日系': ['克制', '安静', '朴素', '干净', '纯粹', '素雅', '淡然',
'内敛', '简洁', '静谧', '温润'],
'美式': ['大胆', '自信', '无畏', '标志性', '经典', '传奇', '震撼',
'鲜明', '独特', '耀眼', '霸气'],
}
OBJECTS_CN = {
'国风': ['裙裾', '衣襟', '袖口', '盘扣', '刺绣', '丝帛', '青衫',
'罗裙', '云肩', '襦裙', '披帛', '马面裙'],
'法式': ['连衣裙', '套装', '风衣', '衬衫', '半裙', '针织衫', '外套',
'吊带裙', '阔腿裤', '西装'],
'日系': ['衬衫', '裤装', '外套', '连衣裙', '针织', '制服', '风衣',
'裙装', '马甲', '工装裤'],
'美式': ['牛仔裤', 'T恤', '夹克', '球鞋', '卫衣', '短裙', '西装',
'大衣', '背心', '工装'],
}
ADJECTIVES_EN = {
'法式': ['romantic', 'elegant', 'chic', 'effortless', 'timeless',
'sophisticated', 'bohème', 'iconic', 'bold', 'graceful'],
'日系': ['minimal', 'serene', 'tranquil', 'pure', 'clean', 'quiet',
'refined', 'subtle', 'austere', 'harmonious'],
'美式': ['bold', 'fierce', 'iconic', 'fearless', 'legendary',
'unapologetic', 'timeless', 'confident', 'statement', 'powerful'],
}
OBJECTS_EN = {
'法式': ['dress', 'ensemble', 'trench', 'blouse', 'skirt', 'sweater',
'coat', 'slip dress', 'wide-leg pants', 'suit'],
'日系': ['shirt', 'trousers', 'jacket', 'dress', 'knit', 'uniform',
'trench', 'skirt', 'vest', 'cargo pants'],
'美式': ['jeans', 'tee', 'jacket', 'sneakers', 'hoodie', 'mini skirt',
'suit', 'coat', 'tank', 'workwear'],
}
@classmethod
def generate(cls, n_per_class: int = 200, seed: int = 42) -> pd.DataFrame:
"""
生成带标签的文案数据集
返回 DataFrame: ['text', 'style', 'style_id']
"""
np.random.seed(seed)
rows = []
style_id_map = {'国风': 0, '法式': 1, '日系': 2, '美式': 3}
for style, templates in cls.TEMPLATES.items():
for _ in range(n_per_class):
template = np.random.choice(templates)
# 中文填充
adj = np.random.choice(cls.ADJECTIVES_CN[style])
obj = np.random.choice(cls.OBJECTS_CN[style])
# 英文填充
adj_en = np.random.choice(cls.ADJECTIVES_EN.get(style, ['beautiful']))
obj_en = np.random.choice(cls.OBJECTS_EN.get(style, ['dress']))
# 诗意结尾
poetic_ends = ['诗意盎然', '韵味悠长', '意境天成', '风雅自来',
'美不胜收', '恰到好处']
poetic_end = np.random.choice(poetic_ends)
text = template.format(
adj=adj, object=obj,
adj_en=adj_en, object_en=obj_en,
poetic_end=poetic_end
)
rows.append({
'text': text,
'style': style,
'style_id': style_id_map[style]
})
df = pd.DataFrame(rows)
# 打乱顺序
df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
return df
# ──────────────────────────────────────────────
# 3. 文本预处理模块
# ──────────────────────────────────────────────
class TextPreprocessor:
"""文本预处理:清洗 + 分词 + 去停用词"""
# 中文字符检测
CN_PATTERN = re.compile(r'[\u4e00-\u9fff]+')
# 日文字符检测
JP_PATTERN = re.compile(r'[\u3040-\u309f\u30a0-\u30ff]+')
# 英文字符检测
EN_PATTERN = re.compile(r'[a-zA-Z]+')
@classmethod
def detect_language(cls, text: str) -> str:
"""检测文本主要语言"""
cn_count = len(cls.CN_PATTERN.findall(text))
jp_count = len(cls.JP_PATTERN.findall(text))
en_count = len(cls.EN_PATTERN.findall(text))
total = cn_count + jp_count + en_count
if total == 0:
return 'unknown'
if cn_count / total > 0.3:
return 'zh'
elif jp_count / total > 0.3:
return 'jp'
elif en_count / total > 0.5:
return 'en'
else:
return 'mixed'
@classmethod
def tokenize(cls, text: str) -> List[str]:
"""
简易分词(不依赖外部库)
中文:按字符切分 + 二元组
日文:按空格/の切分
英文:按空格切分
"""
tokens = []
# 中文:提取所有中文字符 + 二元组
cn_chars = cls.CN_PATTERN.findall(text)
for word in cn_chars:
tokens.append(word)
if len(word) >= 2:
for i in range(len(word) - 1):
tokens.append(word[i:i+2])
# 日文:按空格和の切分
jp_words = re.split(r'[\\sのにでがをはが]', text)
tokens.extend([w for w in jp_words if len(w) > 1])
# 英文:提取单词
en_words = cls.EN_PATTERN.findall(text.lower())
tokens.extend([w for w in en_words if len(w) > 2])
# 去停用词
stopwords = StyleDictionary.STOPWORDS
tokens = [t for t in tokens if t not in stopwords and len(t) > 1]
return tokens
@classmethod
def preprocess(cls, texts: pd.Series) -> pd.Series:
"""批量预处理"""
return texts.apply(lambda t: ' '.join(cls.tokenize(t)))
# ──────────────────────────────────────────────
# 4. 分类器训练与评估模块
# ──────────────────────────────────────────────
class StyleClassifier:
"""服饰文案调性分类器"""
STYLE_NAMES = {0: '国风', 1: '法式', 2: '日系', 3: '美式'}
STYLE_COLORS = {0: '#E74C3C', 1: '#3498DB', 2: '#2ECC71', 3: '#F39C12'}
def __init__(self):
self.vectorizer = TfidfVectorizer(
max_features=2000,
ngram_range=(1, 2),
min_df=2,
max_df=0.8
)
self.classifier = MultinomialNB(alpha=0.1)
self.pipeline = Pipeline([
('tfidf', self.vectorizer),
('nb', self.classifier)
])
self.is_trained = False
def train(self, X_train: pd.Series, y_train: pd.Series):
"""训练分类器"""
X_processed = TextPreprocessor.preprocess(X_train)
self.pipeline.fit(X_processed, y_train)
self.is_trained = True
# 记录训练集特征重要性
feature_names = self.vectorizer.get_feature_names_out()
log_prob = self.classifier.feature_log_prob_
self.feature_importance = {}
for i, style_id in enumerate(self.classifier.classes_):
style_name = self.STYLE_NAMES.get(style_id, str(style_id))
# 取 log 概率最高的特征
top_indices = np.argsort(log_prob[i])[-20:]
self.feature_importance[style_name] = [
(feature_names[idx], round(np.exp(log_prob[i][idx]), 4))
for idx in reversed(top_indices)
]
def predict(self, texts: pd.Series) -> np.ndarray:
"""预测调性类别"""
if not self.is_trained:
raise RuntimeError("分类器未训练,请先调用 train()")
X_processed = TextPreprocessor.preprocess(texts)
return self.pipeline.predict(X_processed)
def predict_proba(self, texts: pd.Series) -> np.ndarray:
"""预测各类别概率"""
if not self.is_trained:
raise RuntimeError("分类器未训练")
X_processed = TextPreprocessor.preprocess(texts)
return self.pipeline.predict_proba(X_processed)
def evaluate(self, X_test: pd.Series, y_test: pd.Series) -> Dict:
"""评估模型性能"""
y_pred = self.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred,
target_names=['国风', '法式', '日系', '美式'],
output_dict=True)
cm = confusion_matrix(y_test, y_pred)
return {
'accuracy': accuracy,
'report': report,
'confusion_matrix': cm,
'y_pred': y_pred
}
def analyze_misclassifications(self,
X_test: pd.Series,
y_test: pd.Series,
texts: pd.Series) -> pd.DataFrame:
"""分析分类错误的样本"""
y_pred = self.predict(X_test)
misclassified = texts[y_test != y_pred].reset_index(drop=True)
true_labels = y_test[y_test != y_pred].reset_index(drop=True)
pred_labels = pd.Series(y_pred[y_test != y_pred]).reset_index(drop=True)
rows = []
for i in range(len(misclassified)):
rows.append({
'text': misclassified.iloc[i],
'true_style': self.STYLE_NAMES.get(true_labels.iloc[i], '?'),
'pred_style': self.STYLE_NAMES.get(pred_labels.iloc[i], '?')
})
return pd.DataFrame(rows)
# ──────────────────────────────────────────────
# 5. 可视化仪表盘模块
# ──────────────────────────────────────────────
class Dashboard:
"""分类结果可视化仪表盘"""
STYLE_COLORS = {'国风': '#E74C3C', '法式': '#3498DB',
'日系': '#2ECC71', '美式': '#F39C12'}
@classmethod
def plot_dashboard(cls,
classifier: StyleClassifier,
eval_result: Dict,
test_texts: pd.Series,
train_df: pd.DataFrame,
misclass_df: pd.DataFrame,
filename: str = "style_classification_dashboard.png"):
fig = plt.figure(figsize=(22, 18))
fig.suptitle('服饰文案文化调性自动分类 — 分析仪表盘',
fontsize=20, fontweight='bold', y=0.99)
# ── 图1:混淆矩阵 ──
ax1 = fig.add_subplot(2, 3, 1)
cls._plot_confusion_matrix(ax1, eval_result['confusion_matrix'])
# ── 图2:各类别准确率 ──
ax2 = fig.add_subplot(2, 3, 2)
cls._plot_per_class_accuracy(ax2, eval_result['report'])
# ── 图3:特征重要性(词云替代:水平柱状图) ──
ax3 = fig.add_subplot(2, 3, 3)
cls._plot_feature_importance(ax3, classifier.feature_importance)
# ── 图4:语言分布 ──
ax4 = fig.add_subplot(2, 3, 4)
cls._plot_language_distribution(ax4, train_df)
# ── 图5:分类错误分析 ──
ax5 = fig.add_subplot(2, 3, 5)
cls._plot_misclassification(ax5, misclass_df)
# ── 图6:样本预测概率分布 ──
ax6 = fig.add_subplot(2, 3, 6)
cls._plot_probability_distribution(ax6, classifier, test_texts)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.savefig(filename, dpi=150, bbox_inches='tight')
plt.show()
print(f"[INFO] 仪表盘已保存: {filename}")
@classmethod
def _plot_confusion_matrix(cls, ax, cm: np.ndarray):
"""混淆矩阵热力图"""
im = ax.imshow(cm, cmap='YlOrRd', aspect='auto')
labels = ['国风', '法式', '日系', '美式']
ax.set_xticks(range(4))
ax.set_yticks(range(4))
ax.set_xticklabels(labels, fontsize=9)
ax.set_yticklabels(labels, fontsize=9)
ax.set_xlabel('预测类别')
ax.set_ylabel('真实类别')
ax.set_title('混淆矩阵', fontsize=13, fontweight='bold')
for i in range(4):
for j in range(4):
color = 'white' if cm[i, j] > cm.max() * 0.5 else 'black'
ax.text(j, i, str(cm[i, j]), ha='center', va='center',
color=color, fontsize=11, fontweight='bold')
plt.colorbar(im, ax=ax, shrink=0.8)
@classmethod
def _plot_per_class_accuracy(cls, ax, report: Dict):
"""各类别精确率/召回率/F1"""
styles = ['国风', '法式', '日系', '美式']
metrics = ['precision', 'recall', 'f1-score']
x = np.arange(len(styles))
width = 0.25
for i, metric in enumerate(metrics):
values = [report[s][metric] for s in styles]
ax.bar(x + i * width, values, width,
label=metric, color=['#3498db', '#e74c3c', '#2ecc71'][i])
ax.set_xticks(x + width)
ax.set_xticklabels(styles, fontsize=9)
ax.set_ylabel('Score')
ax.set_title('各类别分类性能', fontsize=13, fontweight='bold')
ax.legend(fontsize=8)
ax.grid(axis='y', alpha=0.3)
ax.set_ylim(0, 1.15)
@classmethod
def _plot_feature_importance(cls, ax, feat_imp: Dict):
"""各调性 Top 特征词"""
styles = list(feat_imp.keys())[:4]
n_words = 10
y_pos = np.arange(n_words)
colors = ['#E74C3C', '#3498DB', '#2ECC71', '#F39C12']
for i, (style, words) in enumerate(feat_imp.items()):
if i >= 4:
break
top_words = words[:n_words]
scores = [w[1] for w in top_words][::-1]
labels = [w[0] for w in top_words][::-1]
offset = (i - 1.5) * (n_words + 0.5)
ax.barh(y_pos + offset, scores, height=0.7,
color=colors[i], alpha=0.7, label=style)
ax.set_yticks(y_pos)
all_labels = []
for style in styles:
all_labels.extend([w[0] for w in feat_imp[style][:n_
利用 AI解决实际问题,如果你觉得这个工具好用,欢迎关注长安牧笛!