Leveraging open technologies to monitor packet drops in AI cluster fabrics
Open Compute Project Open Compute Project
15.3K subscribers
20 views
1

 Published On Oct 21, 2024

"Aldrin Isaac (Director- Site Network Engineering) - Ebay

AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology- optimizing network devices and protocols- an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops- raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn- it allows the governing management application to tune configurations of relevant infrastructure components- including switches- NICs and GPU servers.\nWe will share the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP‚as SAI and open sFlow drop notification technologies as part of eBay‚as ongoing initiatives to adopt open networking hardware and community SONiC for its data centers. "

show more

Share/Embed