Blurb:
Deep Learning Transformer models are complex in architecture and can have hundreds of millions (or even billions) of parameters, which leads to slow real time inference. Real time low latency inference of Deep Learning models is a critical requirement to enable their usage in production setting.
Our AWS blog shares the technical details of how we achieved this milestone by using Nvidia A10G GPUs, Triton Model Server and TensorRT Model Format. The blog also highlights the cost and latency savings made using the developed infrastructure compared to native Sagemaker CPU based hosting. We have used the developed infra for hosting BART Encoder-Decoder Model (for Spelling Correction) and Sentence Transformer Encoder Only model (for Vector Search).
Blog Link: