AWS Blog In Collaboration With Nvidia – Optimizing Inference For Seq2Seq And Encoder Only Models Using Nvidia GPU And Triton Model Server

Blurb:

Deep Learning Transformer models are complex in architecture and can have hundreds of millions (or even billions) of parameters, which leads to slow real time inference. Real time low latency inference of Deep Learning models is a critical requirement to enable their usage in production setting.

Our AWS blog shares the technical details of how we achieved this milestone by using Nvidia A10G GPUs, Triton Model Server and TensorRT Model Format. The blog also highlights the cost and latency savings made using the developed infrastructure compared to native Sagemaker CPU based hosting. We have used the developed infra for hosting BART Encoder-Decoder Model (for Spelling Correction) and Sentence Transformer Encoder Only model (for Vector Search).

Blog Link:

How Amazon Music uses SageMaker with NVIDIA to optimize ML training and inference performance and cost

	Revamping Dual Encod… on Feature Fusion For The Un…
	Neural Ranking Archi… on Feature Fusion For The Un…
	Neural Ranking Archi… on Talk On Multi Stage Ranki…
	Graph Neural Network… on Attribute Discovery For E-Comm…
	Siddharth Sharma on CTR Prediction System –…

AWS Blog In Collaboration With Nvidia – Optimizing Inference For Seq2Seq And Encoder Only Models Using Nvidia GPU And Triton Model Server

About Siddharth Sharma

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

AWS Blog In Collaboration With Nvidia – Optimizing Inference For Seq2Seq And Encoder Only Models Using Nvidia GPU And Triton Model Server

Share this:

Related

About Siddharth Sharma

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta