当前位置: 首页 > news >正文

CANN/catlass GEMM内核开发详解

GEMM Kernel Code Explained

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

1. Kernel Code Structure Overview

The GEMM Kernel in the CATLASS template library adopts a highly modular design. It assembles different components through template parameters to implement various matrix multiplication operations. This document usesBasicMatmulas an example to break down the core structure and key components of the Kernel code.

2. Template Assembly Mechanism

All GEMM Kernels are defined in the form of template classes, which assemble different functional components through template parameters. TakeBasicMatmulas an example:

template < class BlockMmad_, class BlockEpilogue_, class BlockScheduler_ > class BasicMatmul { public: using BlockMmad = BlockMmad_; using ArchTag = typename BlockMmad::ArchTag; using L1TileShape = typename BlockMmad::L1TileShape; using ElementA = typename BlockMmad::ElementA; using LayoutA = typename BlockMmad::LayoutA; using ElementB = typename BlockMmad::ElementB; using LayoutB = typename BlockMmad::LayoutB; using ElementC = typename BlockMmad::ElementC; using LayoutC = typename BlockMmad::LayoutC; using ElementAccumulator = typename BlockMmad::ElementAccumulator; using BlockScheduler = BlockScheduler_; // ... };

2.1 Core Template Parameters

ParameterDescription
BlockMmad_The core computation component responsible for matrix multiplication
BlockEpilogue_Responsible for epilogues of the computation results (e.g., activation functions, quantization)
BlockScheduler_Responsible for scheduling and distributing computational tasks to different compute cores

2.2 Type Export

The types exported through the template parameters form the Kernel's core type system, which includes:

  • Architecture tag (ArchTag)
  • L1 cache tile shape (L1TileShape)
  • Data types (ElementA/B/C/Accumulator)
  • Data layouts (LayoutA/B/C)

3. Parameter Passing Mechanism

The Kernel uses a two-layer parameter structure:Arguments(user interface layer) andParams(kernel execution layer).

3.1 Arguments

Argumentsis the parameter structure used directly by users. It contains the most basic input and output information:

struct Arguments { GemmCoord problemShape; GM_ADDR ptrA; GM_ADDR ptrB; GM_ADDR ptrC; };

3.2 Params

Paramsis the parameter structure used during actual kernel execution. It contains more detailed execution information:

struct Params { // Data members GemmCoord problemShape; GM_ADDR ptrA; LayoutA layoutA; GM_ADDR ptrB; LayoutB layoutB; GM_ADDR ptrC; LayoutC layoutC; // Methods CATLASS_HOST_DEVICE Params() {} CATLASS_HOST_DEVICE Params(GemmCoord const &problemShape_, GM_ADDR ptrA_, LayoutA layoutA_, GM_ADDR ptrB_, LayoutB layoutB_, GM_ADDR ptrC_, LayoutC layoutC_) : problemShape(problemShape_), ptrA(ptrA_), layoutA(layoutA_), ptrB(ptrB_), layoutB(layoutB_), ptrC(ptrC_), layoutC(layoutC_) {} };

3.3 Parameter Conversion

TheToUnderlyingArgumentsfunction convertsArgumentstoParams:

static Params ToUnderlyingArguments(const Arguments &args, uint8_t *workspace) { LayoutA layoutA{args.problemShape.m(), args.problemShape.k()}; LayoutB layoutB{args.problemShape.k(), args.problemShape.n()}; LayoutC layoutC{args.problemShape.m(), args.problemShape.n()}; Params params{args.problemShape, args.ptrA, layoutA, args.ptrB, layoutB, args.ptrC, layoutC}; return params; }

4. Key Functions

4.1 CanImplement

Checks whether the current hardware and environment support the implementation of this Kernel:

static bool CanImplement(const Arguments &args) { return true; }

4.2 GetWorkspaceSize

Gets the workspace size required for Kernel execution:

static size_t GetWorkspaceSize(const Arguments &args) { return 0; }

4.3 operator()

This is the Kernel's core execution function. It supports different core types (such as AIC, AIV) through template specialization:

template <int32_t CORE_TYPE = g_coreType> CATLASS_DEVICE void operator()(Params const &params); /// Executes one Matmul template <> CATLASS_DEVICE void operator()<AscendC::AIC>(Params const &params) { BlockScheduler matmulBlockScheduler(params.problemShape, MakeCoord(L1TileShape::M, L1TileShape::N)); uint32_t coreLoops = matmulBlockScheduler.GetCoreLoops(); Arch::Resource<ArchTag> resource; BlockMmad blockMmad(resource); // Represent the full gm AscendC::GlobalTensor<ElementA> gmA; gmA.SetGlobalBuffer((__gm__ ElementA *)params.ptrA); AscendC::GlobalTensor<ElementB> gmB; gmB.SetGlobalBuffer((__gm__ ElementB *)params.ptrB); AscendC::GlobalTensor<ElementC> gmC; gmC.SetGlobalBuffer((__gm__ ElementC *)params.ptrC); for (uint32_t loopIdx = AscendC::GetBlockIdx(); loopIdx < coreLoops; loopIdx += AscendC::GetBlockNum()) { // Compute block location GemmCoord blockCoord = matmulBlockScheduler.GetBlockCoord(loopIdx); GemmCoord actualBlockShape = matmulBlockScheduler.GetActualBlockShape(blockCoord); // Compute initial location in logical coordinates MatrixCoord offsetA{blockCoord.m() * L1TileShape::M, blockCoord.k() * L1TileShape::K}; MatrixCoord offsetB{blockCoord.k() * L1TileShape::K, blockCoord.n() * L1TileShape::N}; MatrixCoord offsetC{blockCoord.m() * L1TileShape::M, blockCoord.n() * L1TileShape::N}; int64_t gmOffsetA = params.layoutA.GetOffset(offsetA); int64_t gmOffsetB = params.layoutB.GetOffset(offsetB); int64_t gmOffsetC = params.layoutC.GetOffset(offsetC); // Compute block-scoped matrix multiply-add blockMmad(gmA[gmOffsetA], params.layoutA, gmB[gmOffsetB], params.layoutB, gmC[gmOffsetC], params.layoutC, actualBlockShape); } AscendC::PipeBarrier<PIPE_ALL>(); }

5. Execution Flow Analysis

The Kernel's execution flow divides into the following steps:

5.1 Initializing the Scheduler

BlockScheduler matmulBlockScheduler(params.problemShape, MakeCoord(L1TileShape::M, L1TileShape::N)); uint32_t coreLoops = matmulBlockScheduler.GetCoreLoops();

5.2 Initializing Resources and Compute Components

Arch::Resource<ArchTag> resource; BlockMmad blockMmad(resource);

5.3 Setting Global Memory Tensors

AscendC::GlobalTensor<ElementA> gmA; gmA.SetGlobalBuffer((__gm__ ElementA *)params.ptrA); // Set gmB and gmC...

5.4 Looping Through Each Compute Block

for (uint32_t loopIdx = AscendC::GetBlockIdx(); loopIdx < coreLoops; loopIdx += AscendC::GetBlockNum()) { // 1. Compute block coordinates. GemmCoord blockCoord = matmulBlockScheduler.GetBlockCoord(loopIdx); GemmCoord actualBlockShape = matmulBlockScheduler.GetActualBlockShape(blockCoord); // 2. Compute memory offsets. MatrixCoord offsetA{blockCoord.m() * L1TileShape::M, blockCoord.k() * L1TileShape::K}; // Compute offsetB and offsetC... int64_t gmOffsetA = params.layoutA.GetOffset(offsetA); // Compute gmOffsetB and gmOffsetC... // 3. Execute block-level matrix multiplication. blockMmad(gmA[gmOffsetA], params.layoutA, gmB[gmOffsetB], params.layoutB, gmC[gmOffsetC], params.layoutC, actualBlockShape); }

5.5 Synchronization

AscendC::PipeBarrier<PIPE_ALL>();

6. Extensions and Differences Among Kernels

By comparingBasicMatmul,BatchedMatmul,QuantMatmul, andOptimizedMatmul, you can see their commonalities and differences in the base structure:

6.1 BatchedMatmul Extension

BatchedMatmuladds batch processing support toBasicMatmul:

struct Params { // Data members uint32_t batchCount; // Added batch count GemmCoord problemShape; GM_ADDR ptrA; LayoutA layoutA; int64_t strideA; // Added batch stride for matrix A GM_ADDR ptrB; LayoutB layoutB; int64_t strideB; // Added batch stride for matrix B GM_ADDR ptrC; LayoutC layoutC; int64_t strideC; // Added batch stride for matrix C // ... };

6.2 QuantMatmul Extension

QuantMatmuladds quantization-related parameters and processing:

struct Params { // Data members GemmCoord problemShape; __gm__ ElementA *ptrA; LayoutA layoutA; __gm__ ElementB *ptrB; LayoutB layoutB; __gm__ ElementScale *ptrScale; // Added scale parameters LayoutScale layoutScale; __gm__ ElementPerTokenScale *ptrPerTokenScale; // Added per-token scale parameters LayoutPerTokenScale layoutPerTokenScale; __gm__ ElementD *ptrD; // Added output matrix D LayoutD layoutD; GM_ADDR ptrWorkspace; // Added workspace // ... };

6.3 OptimizedMatmul Extension

OptimizedMatmuladds prologue processing and a more complex parameter structure:

template < class PrologueA, // Added prologue for matrix A class PrologueB, // Added prologue for matrix B class BlockMmad_, class BlockEpilogue_, class BlockScheduler_ > class OptimizedMatmul { // ... template<bool IsPaddingA = true, bool IsPaddingB = true> struct KernelParams : public ParamsBase { // Added padding-related parameters GM_ADDR ptrWA; LayoutWA layoutWA; GM_ADDR ptrWB; LayoutWB layoutWB; // ... }; // ... };

7. Summary

The CATLASS GEMM Kernel adopts a highly modular and template-based design with the following characteristics:

  1. Template assembly: Flexibly assembles different functional components through template parameters, enabling code reuse and function extension.
  2. Layered parameters: Uses Arguments and Params to separate the user interface from kernel execution parameters.
  3. Unified execution process: All Kernels follow a similar execution flow, including initialization, scheduling, computation, and synchronization.
  4. Scalability: By extending the base structure, developers can easily implement advanced features such as batch processing, quantization, and optimization.

This design allows the CATLASS template library to efficiently support a wide range of GEMM operations while ensuring code maintainability and extensibility.

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.gsyq.cn/news/1583068.html

相关文章:

  • Javinizer元数据聚合策略:多源数据合并与优先级设置技巧
  • 3大实战技巧:深度掌握TRL模型微调的核心价值
  • 3步搞定OrcaSlicer安装配置:新手快速上手3D打印切片终极指南
  • 开发者必看:Sing-Guard-2b API接口详解与集成示例
  • Super Productivity容器化部署实战:构建企业级时间管理系统的技术架构解析
  • 950基础矩阵乘法TLA示例
  • CANN/runtime:资源限制内核执行示例
  • laravel-money宏与混入功能:如何优雅扩展货币处理能力?
  • Awesome Claude Skills:构建AI工作流的终极指南与完整实践
  • GroupViT模型训练全指南:从环境配置到COCO数据集评估,新手也能轻松掌握
  • iMonitor脚本编程教程:TypeScript/JavaScript扩展系统监控功能
  • Binwalk v3.1.0:固件分析架构跃迁,性能重构实现10倍加速
  • TornadoVM异构计算实战:3大架构突破与5层性能优化深度解析
  • 如何用BRAT插件轻松管理Obsidian测试版插件:完整指南与实战技巧
  • ComfyUI-LTXVideo完全指南:如何在5分钟内开启AI视频创作新时代
  • HiApp网络请求优化:Axios在移动应用中的最佳配置与实践
  • 如何用AI+BI平台在3分钟内让数据开口说话?
  • 从零到一:我是如何让wewe-rss成为我的私人信息助理的
  • WubiLex五笔助手终极指南:让Windows五笔输入法焕然新生的简单教程
  • MrRSS:终极AI RSS阅读器完整指南 - 3大核心功能让你快速掌握智能阅读
  • 深度解析:UniToon物理卡通着色器的架构设计与实现原理
  • 3个实用技巧解决luci-app-ddns-go日志时间显示问题
  • 当AI音乐创作不再需要订阅费:探索本地化AI音乐生成的新可能
  • Mamba分布式训练架构深度解析:突破性状态空间模型的高性能可扩展方案
  • ToastFish:5分钟学会用Windows通知栏高效背单词的摸鱼神器
  • 终极图像管理方案:Geeqie - 免费开源的强大图片查看器
  • WezTerm:GPU加速终端如何重塑现代开发者的工作流体验
  • Typhon H2cFilter实战指南:如何轻松启用HTTP/2明文通信以提升服务性能
  • Joplin终极指南:打造你的私有化跨平台笔记系统
  • 深度解析:C++11线程池与SafeQueue的高效实现实战指南