AWS Blog In Collaboration With Nvidia – Optimizing Inference For Seq2Seq And Encoder Only Models Using Nvidia GPU And Triton Model Server

Blurb:

Deep Learning Transformer models are complex in architecture and can have hundreds of millions (or even billions) of parameters, which leads to slow real time inference. Real time low latency inference of Deep Learning models is a critical requirement to enable their usage in production setting.

Our AWS blog shares the technical details of how we achieved this milestone by using Nvidia A10G GPUs, Triton Model Server and TensorRT Model Format. The blog also highlights the cost and latency savings made using the developed infrastructure compared to native Sagemaker CPU based hosting. We have used the developed infra for hosting BART Encoder-Decoder Model (for Spelling Correction) and Sentence Transformer Encoder Only model (for Vector Search).

Blog Link:

How Amazon Music uses SageMaker with NVIDIA to optimize ML training and inference performance and cost

About Siddharth Sharma

Interested in NLP, Retrieval & Ranking Models, Content Understanding and Predictive Analytics.
This entry was posted in Uncategorized and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a comment