news 2026/6/25 6:57:12

950基础矩阵乘法TLA示例

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
950基础矩阵乘法TLA示例

950 Basic Matmul TLA Example Readme

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

Note: The community package does not currently support 950 capabilities. Stay tuned for a future supported version.

Code Organization

├── 43_ascend950_basic_matmul │ ├── CMakeLists.txt # CMake build file │ ├── README.md │ └── basic_matmul_tla.cpp # Main file

Usage Example

  • After obtaining the code, build the corresponding operator executable. See Template Library Quick Start. This case is a 950 operator, and-DCATLASS_ARCH=3510must be added during build.
  • Run the operator.
# Build the specified case bash scripts/build.sh 43_ascend950_basic_matmul -DCATLASS_ARCH=3510 cd output/bin # Executable file name | matrix m axis | n axis | k axis | Device ID # Device ID is optional and defaults to 0 ./43_ascend950_basic_matmul 256 512 1024 0

The execution result is as follows, indicating that the precision comparison succeeds.

Compare success.

Usage Notes

The DispatchPolicy MmadPingpong used by BasicMatmul by default supports the following template parameters:

Template ParameterDefault ValueParameter Description
ArchTagNoneSpecifies the architecture model
enableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enabled
useHF32falseSpecifies whether to enable HF32. Only the float type is supported
l0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double buffering
enableL1ResidentfalseSpecifies whether to enable L1 residency
l1AStages2Number of buffers for loading matrix A on L1
l1BStages2Number of buffers for loading matrix B on L1
l0AStages2Number of buffers for loading matrix A on L0
l0BStages2Number of buffers for loading matrix B on L0

Assume the matrix Shape isM N K, the tile size on L1 ism1 n1 k1, the number of tiles in the M direction ismTiles = CeilDiv(M, m1), the number of tiles in the N direction isnTiles = CeilDiv(N, n1), and the total number of tasks istaskBlocks = mTiles * nTiles. enableL1Resident can be enabled in the following two cases:

  1. mTiles = 1,nTiles > CoreNum, andK < 2 * k1. In this case,l0CStages=2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages=2cannot be set, setn1to half of the original value.

  2. nTiles = 1,mTiles > CoreNum, andK < 2 * k1. In this case,l0CStages=2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages=2cannot be set, setm1to half of the original value.

BasicMatmul also supports DispatchPolicy MmadPreloadAsyncWithCallback, which supports the following template parameters:

Template ParameterDefault ValueParameter Description
ArchTagNoneSpecifies the architecture model
preloadStagesNoneSpecifies the number of preloads
l1AStages2Number of buffers for loading matrix A on L1
l1BStages2Number of buffers for loading matrix B on L1
l0AStages2Number of buffers for loading matrix A on L0
l0BStages2Number of buffers for loading matrix B on L0
l0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double buffering
enableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enabled
enableShuffleKfalseSpecifies whether to enable K-direction staggered reading
useHF32falseSpecifies whether to enable HF32. Only the float type is supported
enableL1ResidentfalseSpecifies whether to enable L1 residency

Compared withMmadPingpong,MmadPreloadAsyncWithCallbackhas two more template parameters. One ispreloadStages. This parameter is usually set to 1 and specifies the number of preloads. When this parameter is set to 1, the first loop only loads data and does not perform matmul computation. The second loop first loads the data for the second loop, and then completes the Matmul computation of the previous loop, and so on. After the final loop ends, one additional Matmul computation is performed. The benefit is that the data required for the current Matmul computation has already been moved in the previous loop. Therefore, instruction issue is advanced, which reduces the performance loss caused by instruction issue latency.

The second parameter isenableShuffleK. This parameter is mainly used to avoid bandwidth loss caused by same-address access conflicts. The main principle is to stagger the data read addresses of each core. This parameter does not need to be enabled on 950.

Compared withMmadPingpong,MmadPreloadAsyncWithCallbackhas more optimization points, but its logic is also more complex and has higher Scalar overhead. Use it based on the scenario, especially for small Shape scenarios.

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/6/24 6:07:29

PhoneVR项目路线图:未来功能和发展方向展望

PhoneVR项目路线图&#xff1a;未来功能和发展方向展望 【免费下载链接】PhoneVR Use Steam VR-enabled applications with your phone as HMD (Head-mounted display). The only Open-Source solution to similar commercial packages like VRidge, iVRy, Trinus etc etc. 项…

作者头像 李华
网站建设 2026/6/24 6:06:59

终极iOS越狱指南:使用palera1n轻松解锁iPhone系统权限

终极iOS越狱指南&#xff1a;使用palera1n轻松解锁iPhone系统权限 【免费下载链接】palera1n Jailbreak for A8 through A11, T2 devices, on iOS/iPadOS/tvOS 15.0, bridgeOS 5.0 and higher. 项目地址: https://gitcode.com/GitHub_Trending/pa/palera1n palera1n是一…

作者头像 李华
网站建设 2026/6/24 6:00:14

如何用AI+BI平台在3分钟内让数据开口说话?

如何用AIBI平台在3分钟内让数据开口说话&#xff1f; 【免费下载链接】supersonic SuperSonic is the next-generation AIBI platform that unifies Chat BI (powered by LLM) and Headless BI (powered by semantic layer) paradigms. 项目地址: https://gitcode.com/GitHub…

作者头像 李华
网站建设 2026/6/24 6:00:04

从零到一:如何用AFDKO打造专业的OpenType字体?

从零到一&#xff1a;如何用AFDKO打造专业的OpenType字体&#xff1f; 【免费下载链接】afdko Adobe Font Development Kit for OpenType 项目地址: https://gitcode.com/gh_mirrors/af/afdko 你是否曾经好奇&#xff0c;那些精美字体背后的技术秘密是什么&#xff1f;当…

作者头像 李华
网站建设 2026/6/24 5:59:56

告别单调终端:3步打造你的专属Terminator主题生态系统

告别单调终端&#xff1a;3步打造你的专属Terminator主题生态系统 【免费下载链接】terminator-themes :metal: The biggest collection of themes for Terminator terminal. 项目地址: https://gitcode.com/gh_mirrors/te/terminator-themes 你是否厌倦了千篇一律的终端…

作者头像 李华
网站建设 2026/6/24 5:57:04

如何让喜欢的角色住进桌面?5分钟快速上手DyberPet桌宠系统

如何让喜欢的角色住进桌面&#xff1f;5分钟快速上手DyberPet桌宠系统 【免费下载链接】DyberPet Desktop Cyber Pet Framework based on PySide6 项目地址: https://gitcode.com/GitHub_Trending/dy/DyberPet 想让喜欢的二次元角色真正"住进"你的电脑桌面吗&…

作者头像 李华