_2025-02-27_11：58：26_	2025-02-27 11：58：26
DeepSeek开源周总结和感悟【更新至第三天】 - 知乎
原文链接失效了?试试备份
TAGs:处理器异构计算软硬协同大模型
Summary: This text is a summary of a blog post about DeepSeek, an open-source project that optimizes AI performance on specific hardware. The author expresses their feelings about the challenges of completing such work in foreign AI companies due to hardware restrictions. They highlight three projects, FlashMLA, DeepEP, and DeepGEMM, and their contributions to improving AI performance on limited hardware. The author emphasizes the importance of understanding both AI models and hardware for optimal performance and the potential impact of DeepSeek on the industry.本文是关于 DeepSeek 的博客文章的摘要，DeepSeek 是一个开源项目，可优化特定硬件上的 AI 性能。作者表达了他们对由于硬件限制而在国外 AI 公司完成此类工作所面临的挑战的感受。他们重点介绍了 FlashMLA、DeepEP 和 DeepGEMM 三个项目，以及它们对在有限硬件上提高 AI 性能的贡献。作者强调了了解 AI 模型和硬件以实现最佳性能的重要性，以及 DeepSeek 对行业的潜在影响。

_2025-06-16_18：53：03_	2025-06-16 18：53：03
LLM 推理优化竟然和操作系统这么像？一文看懂 Page Attention 与 vLLM 的底层设计哲学 - 知乎
原文链接失效了?试试备份
TAGs:大模型
Summary: This article discusses the similarities between the design philosophies of Page Attention in vLLM and operating system page management. The author explains that in the process of LLM inference, there are resource scheduling issues similar to those in operating systems. The Page Attention mechanism in vLLM aims to solve the problems of memory fragmentation and context cache reuse efficiency in LLM inference by drawing inspiration from operating system paging management. The author also highlights the benefits of PagedAttention, such as eliminating external fragmentation, reducing internal fragmentation, and supporting sharing and copying of KV Cache between requests. The article also mentions the concept of lazy allocation and on-demand allocation in operating systems and how vLLM adopts a similar strategy. The author concles that by viewing models as "a service system," one can find inspiration from operating system experiences. The article is written by YixinMLSys and has been read and liked by 602 people.

_2025-06-17_08：49：30_	2025-06-17 08：49：30
LLM 推理优化竟然和操作系统这么像？一文看懂 Page Attention 与 vLLM 的底层设计哲学 - 知乎
原文链接失效了?试试备份
TAGs:大模型
Summary:

_2025-02-28_18：20：02_	2025-02-28 18：20：02
deepseek-ai_DeepGEMM： DeepGEMM：干净高效的 FP8 GEMM 内核，具有细粒度扩展 --- deepseek-ai_DeepGEMM_ DeepGEMM_ clean and efficient FP8 GEMM ker
原文链接失效了?试试备份
TAGs:大模型
Summary: DeepGEMM is a library designed for clean and efficient FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, using Hopper architecture GPUs and CUDA 12.3 or above. It supports both normal and Mix-of-Experts (MoE) grouped GEMMs, with various optimizations such as warp-specialization, TMA features, and a fully JIT design. The library also provides utility functions and environment variables.DeepGEMM 是一个库，旨在使用 Hopper 架构 GPU 和 CUDA 12.3 或更高版本，实现干净高效的 FP8 通用矩阵乘法（GEMM），进行精细缩放。它支持普通和混合专家（MoE）分组 GEMM，具有各种优化，例如 warp 专业化、TMA 功能和完全 JIT 设计。该库还提供实用程序函数和环境变量。

_2025-06-06_16：12：28_	2025-06-06 16：12：28
说人话之什么是Token？
原文链接失效了?试试备份
TAGs:大模型
Summary: This article explains the concept of a Token in the context of large models, specifically in natural language processing. A Token is a unit used by models to understand and process human language. It can be a single character, a word, or even a punctuation mark. The length and definition of a Token depend on the tokenizer used by the model. In English, spaces between words make tokenization easier, but in Chinese, where words don't have inherent spaces, tokenization is more complex. Modern large models use subword tokenization, which breaks down words into smaller units like prefixes, suffixes, or common combinations, to balance language expressiveness and model efficiency. This allows the model to understand the meaning of words through the relationships between these subwords. For example, the word "unhappiness" can be broken down into "un," "happy," and "ness," even if the model hasn't seen the word before, it can infer the meaning based on the subwords. In summary, Tokens are essential for large models to understand and process human language by converting complex language information into standardized units that models can handle.本文介绍了大型模型上下文中的 Token 概念，特别是在自然语言处理中。Token 是模型用来理解和处理人类语言的单元。它可以是单个字符、单词，甚至是标点符号。Token 的长度和定义取决于模型使用的 tokenizer。在英语中，单词之间的空格使分词更容易，但在中文中，单词没有固有的空格，分词化更复杂。现代大型模型使用子词标记化，将单词分解为较小的单元，如前缀、后缀或常见组合，以平衡语言表达性和模型效率。这允许模型通过这些子词之间的关系来理解单词的含义。例如，“unhappiness”这个词可以分解成“un”、“happy”和“ness”，即使模型以前没有见过这个词，它也可以根据子词推断出含义。总之，Tokens 对于大型模型理解和处理人类语言至关重要，它将复杂的语言信息转换为模型可以处理的标准化单元。

_2025-06-07_14：49：29_	2025-06-07 14：49：29
说人话之什么是Token？
原文链接失效了?试试备份
TAGs:大模型
Summary:

_2025-06-06_20：05：18_	2025-06-06 20：05：18
说人话之什么是Transformer？
原文链接失效了?试试备份
TAGs:大模型
Summary: The Transformer is a technology used in AI models, such as ChatGPT, which allows machines to understand and process text as effectively as humans do. It uses attention mechanisms to help machines focus on important information and relationships within text, improving reading efficiency and resolving the long-distance dependency problem. The attention mechanism works by allowing each word in a sentence to "look back" at other words in the sentence and determine their relationship. This helps the model understand the dependencies between words, even if they are far apart in the sentence. The Transformer also uses self-attention and multi-head attention mechanisms to enhance its capabilities.Transformer 是一种用于 AI 模型（例如 ChatGPT）的技术，它使机器能够像人类一样有效地理解和处理文本。它使用注意力机制来帮助机器专注于文本中的重要信息和关系，提高阅读效率并解决远距离依赖问题。注意力机制的工作原理是允许句子中的每个单词 “回顾” 句子中的其他单词并确定它们之间的关系。这有助于模型理解单词之间的依赖关系，即使它们在句子中相距甚远。Transformer 还使用自注意力和多头注意力机制来增强其能力。