当前位置: 首页 > news >正文

950基础矩阵乘法TLA示例

950 Basic Matmul TLA Example Readme

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

Note: The community package does not currently support 950 capabilities. Stay tuned for a future supported version.

Code Organization

├── 43_ascend950_basic_matmul │ ├── CMakeLists.txt # CMake build file │ ├── README.md │ └── basic_matmul_tla.cpp # Main file

Usage Example

  • After obtaining the code, build the corresponding operator executable. See Template Library Quick Start. This case is a 950 operator, and-DCATLASS_ARCH=3510must be added during build.
  • Run the operator.
# Build the specified case bash scripts/build.sh 43_ascend950_basic_matmul -DCATLASS_ARCH=3510 cd output/bin # Executable file name | matrix m axis | n axis | k axis | Device ID # Device ID is optional and defaults to 0 ./43_ascend950_basic_matmul 256 512 1024 0

The execution result is as follows, indicating that the precision comparison succeeds.

Compare success.

Usage Notes

The DispatchPolicy MmadPingpong used by BasicMatmul by default supports the following template parameters:

Template ParameterDefault ValueParameter Description
ArchTagNoneSpecifies the architecture model
enableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enabled
useHF32falseSpecifies whether to enable HF32. Only the float type is supported
l0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double buffering
enableL1ResidentfalseSpecifies whether to enable L1 residency
l1AStages2Number of buffers for loading matrix A on L1
l1BStages2Number of buffers for loading matrix B on L1
l0AStages2Number of buffers for loading matrix A on L0
l0BStages2Number of buffers for loading matrix B on L0

Assume the matrix Shape isM N K, the tile size on L1 ism1 n1 k1, the number of tiles in the M direction ismTiles = CeilDiv(M, m1), the number of tiles in the N direction isnTiles = CeilDiv(N, n1), and the total number of tasks istaskBlocks = mTiles * nTiles. enableL1Resident can be enabled in the following two cases:

  1. mTiles = 1,nTiles > CoreNum, andK < 2 * k1. In this case,l0CStages=2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages=2cannot be set, setn1to half of the original value.

  2. nTiles = 1,mTiles > CoreNum, andK < 2 * k1. In this case,l0CStages=2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages=2cannot be set, setm1to half of the original value.

BasicMatmul also supports DispatchPolicy MmadPreloadAsyncWithCallback, which supports the following template parameters:

Template ParameterDefault ValueParameter Description
ArchTagNoneSpecifies the architecture model
preloadStagesNoneSpecifies the number of preloads
l1AStages2Number of buffers for loading matrix A on L1
l1BStages2Number of buffers for loading matrix B on L1
l0AStages2Number of buffers for loading matrix A on L0
l0BStages2Number of buffers for loading matrix B on L0
l0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double buffering
enableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enabled
enableShuffleKfalseSpecifies whether to enable K-direction staggered reading
useHF32falseSpecifies whether to enable HF32. Only the float type is supported
enableL1ResidentfalseSpecifies whether to enable L1 residency

Compared withMmadPingpong,MmadPreloadAsyncWithCallbackhas two more template parameters. One ispreloadStages. This parameter is usually set to 1 and specifies the number of preloads. When this parameter is set to 1, the first loop only loads data and does not perform matmul computation. The second loop first loads the data for the second loop, and then completes the Matmul computation of the previous loop, and so on. After the final loop ends, one additional Matmul computation is performed. The benefit is that the data required for the current Matmul computation has already been moved in the previous loop. Therefore, instruction issue is advanced, which reduces the performance loss caused by instruction issue latency.

The second parameter isenableShuffleK. This parameter is mainly used to avoid bandwidth loss caused by same-address access conflicts. The main principle is to stagger the data read addresses of each core. This parameter does not need to be enabled on 950.

Compared withMmadPingpong,MmadPreloadAsyncWithCallbackhas more optimization points, but its logic is also more complex and has higher Scalar overhead. Use it based on the scenario, especially for small Shape scenarios.

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.gsyq.cn/news/1583056.html

相关文章:

  • CANN/runtime:资源限制内核执行示例
  • laravel-money宏与混入功能:如何优雅扩展货币处理能力?
  • Awesome Claude Skills:构建AI工作流的终极指南与完整实践
  • GroupViT模型训练全指南:从环境配置到COCO数据集评估,新手也能轻松掌握
  • iMonitor脚本编程教程:TypeScript/JavaScript扩展系统监控功能
  • Binwalk v3.1.0:固件分析架构跃迁,性能重构实现10倍加速
  • TornadoVM异构计算实战:3大架构突破与5层性能优化深度解析
  • 如何用BRAT插件轻松管理Obsidian测试版插件:完整指南与实战技巧
  • ComfyUI-LTXVideo完全指南:如何在5分钟内开启AI视频创作新时代
  • HiApp网络请求优化:Axios在移动应用中的最佳配置与实践
  • 如何用AI+BI平台在3分钟内让数据开口说话?
  • 从零到一:我是如何让wewe-rss成为我的私人信息助理的
  • WubiLex五笔助手终极指南:让Windows五笔输入法焕然新生的简单教程
  • MrRSS:终极AI RSS阅读器完整指南 - 3大核心功能让你快速掌握智能阅读
  • 深度解析:UniToon物理卡通着色器的架构设计与实现原理
  • 3个实用技巧解决luci-app-ddns-go日志时间显示问题
  • 当AI音乐创作不再需要订阅费:探索本地化AI音乐生成的新可能
  • Mamba分布式训练架构深度解析:突破性状态空间模型的高性能可扩展方案
  • ToastFish:5分钟学会用Windows通知栏高效背单词的摸鱼神器
  • 终极图像管理方案:Geeqie - 免费开源的强大图片查看器
  • WezTerm:GPU加速终端如何重塑现代开发者的工作流体验
  • Typhon H2cFilter实战指南:如何轻松启用HTTP/2明文通信以提升服务性能
  • Joplin终极指南:打造你的私有化跨平台笔记系统
  • 深度解析:C++11线程池与SafeQueue的高效实现实战指南
  • Hindsight智能体记忆系统:3种部署方案让AI真正学会思考与成长
  • ToastFish:如何用Windows通知栏在碎片时间高效背单词
  • Kokoro多语言语音合成架构深度解析:82M参数轻量级TTS模型技术实现方案
  • 从0到1理解Typhon Router:构建高性能API路由的完整指南
  • 终极指南:如何将SmartSystemMenu打造成你的Windows效率神器
  • Stata数据分析工具箱:世界银行专家教你如何3步完成专业级统计报告