多模态大模型前沿必看（非常干货），2025年度论文精选深度解读。-平芜编程栈

我们从2025-12-12到2025-12-19的315篇文章中精选出10篇优秀的工作分享给读者，主要研究方向包括：情感识别中的情感差距, 视觉-文本压缩下的长文本理解能力, 全手触觉感知的多模态数据集构建, 互动智能数字人, 语音模态集成对大语言模型的影响, 城市导航中的知识驱动推理, 长时间高分辨率的音频驱动头像视频生成, 复杂数学表达式识别, 多模态视频生成与编辑, 自监督视觉学习在多模态大语言模型中的应用

1.Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Authors: Kejun Liu, Yuanyuan Liu, Lin Wei, Chang Tang, Yibing Zhan, Zijing Chen, Zhe Chen

Affiliations: China University of Geosciences (Wuhan); Huazhong University of Science and Technology; Wuhan University; La Trobe University

https://arxiv.org/abs/2512.16485

论文摘要

Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

论文简评: 本论文的动机在于解决传统面部表情识别（FER）方法在情感识别（ER）中所面临的情感差距问题。作者提出了一种新颖的眼动行为辅助多模态情感识别（EMER）数据集，通过结合面部表情视频、眼动序列和眼动注视图，收集真实情感数据，并采用刺激诱导法来引发自发情感反应。此外，作者设计了一种有效的EMER Transformer（EMERT）模型，通过对抗性学习和多任务Transformer方法来有效提取情感特征。实验结果表明，EMERT在各种多模态基准测试中显著优于其他先进方法，验证了眼动行为在提高情感识别稳健性方面的重要性。

2.VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Authors: Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang

Affiliations: Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, CAS; Tencent Hunyuan Team

https://arxiv.org/abs/2512.15649

论文摘要

The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

论文简评: 本论文探讨了视觉-文本压缩（VTC）对视觉语言模型（VLM）在长文本理解能力的影响，提出了一个新的基准测试框架VTCBench。研究的动机在于现有长文本处理方法的局限性，尤其是在上下文长度增加时性能下降的问题。通过设计包括VTC-Retrieval、VTC-Reasoning和VTC-Memory等任务，作者对多种VLM进行了全面评估，结果表明，在视觉-文本压缩框架下，现有VLM在长文本理解方面表现较弱，尤其是在复杂的推理和记忆任务中，相较于文本模型有显著的性能差距。研究为未来优化长文本VLM的设计提供了重要的指导。

3.OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Authors: Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang

Affiliations: MIT; Duke University; Brown University; University of Washington; Harvard University

https://arxiv.org/abs/2512.16842

论文摘要

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

论文简评: 本文介绍了O PEN T OUCH，一个首个在真实环境中捕捉的全手触觉数据集，旨在填补视觉感知与物理交互之间的空白。该数据集包含5.1小时的同步视频、触觉和手势数据，涵盖多种日常环境和物体。通过设计低成本的传感器和数据收集方法，作者展示了触觉信号在理解抓握和增强跨模态对齐方面的重要性。实验结果表明，结合触觉、视频和手势信息的多模态方法显著提高了检索和分类精度，强调了触觉在自然操作中的关键作用。

4.Towards Interactive Intelligence for Digital Humans

Authors: Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, Dianwen Ng, Zixiong Su, Erwin Wu, Yuhan Wu, Dingkun Yan, Tianyu Yan, Chang Zeng, Bo Zheng, You Z

Affiliations: Shanda AI Research; The University of Tokyo; Institute of Science Tokyo; National Institute of Informatics

https://arxiv.org/abs/2512.13674

论文摘要

We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

论文简评: 本文提出了一个名为“Mio”的互动智能数字人框架，旨在实现个性化表达、自适应互动和自我进化。通过五个专门模块的集成，Mio能够实现流畅和一致的多模态互动。实验结果表明，该框架在多个评估维度上均优于现有方法，标志着数字人从表面模仿向智能互动的转变。

5.Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Authors: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

Affiliations: Fondazione Bruno Kessler; Barcelona Supercomputing Center; University of Zurich; ETH Zurich; Universitat Politècnica de Catalunya; Universitat Politècnica de València; AI-Bio Convergence Research Institute; Charles University; KIT

https://arxiv.org/abs/2512.16378

论文摘要

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
论文简评: 本文探讨了将语音模态集成到大语言模型（LLMs）中的有效性，提出了“Hearing to Translate”测试套件，对五种最先进的语音LLMs与16种强大的直接和级联系统进行了系统比较。实验结果表明，尽管语音LLMs在某些特定设置下与级联系统的表现相当，但整体上，级联系统仍然是最可靠的选择，强调了在高质量语音翻译中集成LLM的重要性。

6.City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

Authors: Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic

Affiliations: University of Illinois Urbana-Champaign; Texas A&M University; Microsoft Research, Redmond

https://arxiv.org/abs/2512.15933

论文摘要

Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent’s internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

论文简评: 本文提出了一种新颖的城市导航评估任务——Sparsely Grounded Visual Navigation，旨在通过多模态大语言模型（MLLMs）评估其在复杂城市环境中的推理能力。研究者们构建了CityNav基准，涵盖四个全球城市，以测试MLLMs在没有额外环境注释的情况下，通过视觉输入和内部推理进行城市导航的能力。实验结果表明，现有的推理技术在此任务中表现不佳，而提出的Verbalization of Path（VoP）方法有效地增强了模型的导航成功率，展示了MLLMs在动态、知识密集的决策任务中的潜力和局限性。

7.KlingAvatar 2.0 Technical Report

Authors: Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang , et al. (3 additional authors not shown)

Affiliations: Kuaishou Technology

https://arxiv.org/abs/2512.13313

论文摘要

Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

论文简评: 本论文提出了KlingAvatar 2.0，一个用于生成高分辨率长时间音频驱动头像视频的统一框架。研究动机在于现有方法在生成长时间视频时效率低下，常常导致时间漂移和质量下降。为了解决这些问题，作者提出了一种时空级联框架，通过生成低分辨率蓝图视频关键帧并逐步细化为高分辨率、时间一致的子片段。此外，作者引入了一个多模态的协同推理导演，利用大型语言模型进行用户意图推理和指令对齐。实验结果表明，该模型在视觉清晰度、身份保持和多模态指令跟随等方面表现优异，显著提升了长时间视频生成的效果和效率。

8.Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline

Authors: Weikang Bai, Yongkun Du, Yuchen Su, Yazhen Xie, Zhineng Chen

Affiliations: Fudan University

https://arxiv.org/abs/2512.13731

论文摘要

Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.

论文简评: 本文旨在解决复杂数学表达式识别（MER）中的挑战，特别是现有模型在处理多行和包含多个符号的复杂表达式时的表现不足。为此，作者提出了CMER-Bench基准测试和两种新数据集CMER-3M与MER-17M，以支持更复杂表达式的训练和评估。通过引入新的表达式分词器和结构化数学语言表示，作者开发了名为CMERNet的专用模型。实验结果表明，CMERNet在CMER-Bench上表现优异，显著优于现有的MER模型和多模态大语言模型，展示了其在复杂MER任务中的有效性。

9.Kling-Omni Technical Report

Authors: Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu , et al. (43 additional authors not shown)

Affiliations: Kuaishou Technology

https://arxiv.org/abs/2512.16776

论文摘要

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

论文简评: 本文提出了一种名为Kling-Omni的通用生成框架，旨在通过多模态视觉语言输入直接合成高保真视频。研究动机在于克服当前视频生成领域的碎片化问题，Kling-Omni通过统一的架构整合了视频生成、编辑和智能推理任务，支持多种用户输入形式。通过构建全面的数据系统和采用高效的大规模预训练策略，该框架在上下文生成、基于推理的编辑和多模态指令跟随等方面展现出卓越能力。实验结果表明，Kling-Omni不仅是一种内容创作工具，更是朝着多模态世界模拟器的重要进展，能够感知、推理、生成和与复杂动态世界互动。

10.Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Authors: Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara

Affiliations: University of Modena and Reggio Emilia; AMD Silo AI

https://arxiv.org/abs/2512.15885

论文摘要

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

论文简评: 本论文提出了JARVIS，一个旨在提升多模态大语言模型（MLLMs）视觉感知能力的框架。研究动机源于当前MLLMs在基本视觉推理任务中的不足，主要是因为它们依赖文本描述进行视觉学习，而这些描述往往主观且不完整。为了解决这一问题，作者引入了I-JEPA自监督学习范式，通过在标准的视觉-语言对齐训练流程中集成该方法，允许模型在不依赖语言监督的情况下学习图像的结构和语义规律。实验结果表明，JARVIS在多个视觉基准测试上显著提高了性能，同时保持了多模态推理能力。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

第一阶段（10天）：初阶应用

该阶段让大家对大模型 AI有一个最前沿的认识，对大模型 AI 的理解超过 95% 的人，可以在相关讨论时发表高级、不跟风、又接地气的见解，别人只会和 AI 聊天，而你能调教 AI，并能用代码将大模型和业务衔接。

大模型 AI 能干什么？
大模型是怎样获得「智能」的？
用好 AI 的核心心法
大模型应用业务架构
大模型应用技术架构
代码示例：向 GPT-3.5 灌入新知识
提示工程的意义和核心思想
Prompt 典型构成
指令调优方法论
思维链和思维树
Prompt 攻击和防范
…

第二阶段（30天）：高阶应用

该阶段我们正式进入大模型 AI 进阶实战学习，学会构造私有知识库，扩展 AI 的能力。快速开发一个完整的基于 agent 对话机器人。掌握功能最强的大模型开发框架，抓住最新的技术进展，适合 Python 和 JavaScript 程序员。

为什么要做 RAG
搭建一个简单的 ChatPDF
检索的基础概念
什么是向量表示（Embeddings）
向量数据库与向量检索
基于向量检索的 RAG
搭建 RAG 系统的扩展知识
混合检索与 RAG-Fusion 简介
向量模型本地部署
…

第三阶段（30天）：模型训练

恭喜你，如果学到这里，你基本可以找到一份大模型 AI相关的工作，自己也能训练 GPT 了！通过微调，训练自己的垂直大模型，能独立训练开源多模态大模型，掌握更多技术方案。

到此为止，大概2个月的时间。你已经成为了一名“AI小子”。那么你还想往下探索吗？

为什么要做 RAG
什么是模型
什么是模型训练
求解器 & 损失函数简介
小实验2：手写一个简单的神经网络并训练它
什么是训练/预训练/微调/轻量化微调
Transformer结构简介
轻量化微调
实验数据集的构建
…

第四阶段（20天）：商业闭环

对全球大模型从性能、吞吐量、成本等方面有一定的认知，可以在云端和本地等多种环境下部署大模型，找到适合自己的项目/创业方向，做一名被 AI 武装的产品经理。

硬件选型
带你了解全球大模型
使用国产大模型服务
搭建 OpenAI 代理
热身：基于阿里云 PAI 部署 Stable Diffusion
在本地计算机运行大模型
大模型的私有化部署
基于 vLLM 部署大模型
案例：如何优雅地在阿里云私有部署开源大模型
部署一套开源 LLM 项目
内容安全
互联网信息服务算法备案
…

学习是一个过程，只要学习就会有挑战。天道酬勤，你越努力，就会成为越优秀的自己。

如果你能在15天内完成所有的任务，那你堪称天才。然而，如果你能完成 60-70% 的内容，你就已经开始具备成为一名大模型 AI 的正确特征了。

【Open-AutoGLM部署必备指南】：揭秘高效运行所需的硬件配置与性能要求

多模态大模型前沿必看（非常干货），2025年度论文精选深度解读。

1.Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

论文摘要

2.VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

论文摘要

3.OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

论文摘要

4.Towards Interactive Intelligence for Digital Humans

论文摘要

5.Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

论文摘要