The Challenges and Practices of Network Stability in Alibabas Large Scale Computing Clusters
Open Compute Project Open Compute Project
15.3K subscribers
61 views
1

 Published On Oct 21, 2024

"Xuemei Shi (Technical Engineer) - Alibaba
Surendra Anubolu (Distinguished Engineer) - Broadcom Inc

Unlike traditional high-performance RDMA storage networks- the presence of AI/ML training synchronization operators like allreduce and all2all has led large-scale computing clusters to demand extremely high network stability under the concept of DC as a computer. This presentation shares how that Alibaba has achieved this higher level of network stability in large-scale computing cluster networks through the implementation of unified monitoring- high-precision flow monitoring- and technologies such as A.M.D- all guided by the DC as a computer philosophy.\nhigh-precision flow monitoring- sub-millisecond granularity flow-based rate statistics- are used to observe computing traffic patterns- identify micro-congestion points- and assist in congestion control optimization.\nA.M.D stands for Alternate Marking DSCP. With this implementation- we can detect and locate any packet loss event anywhere in the network at the second level."

show more

Share/Embed