-
Recent Posts
- AWS Blog In Collaboration With Nvidia – Optimizing Inference For Seq2Seq And Encoder Only Models Using Nvidia GPU And Triton Model Server
- ~30% Compression Of LLM (Flan-T5-Base) With Low Rank Decomposition Of Attention Weight Matrices
- Adapter Based Fine Tuning BART And T5-Flan-XXL For Single Word Spell Correction
- Revamping Dual Encoder Model Architecture: A layered approach to fuse multi-modal features and plug-and-play integration of Encoders
- Summary Of Adapter Based Performance Efficient Fine Tuning (PEFT) Techniques For Large Language Models
Recent Comments
Archives
Categories
Meta
Monthly Archives: November 2023
AWS Blog In Collaboration With Nvidia – Optimizing Inference For Seq2Seq And Encoder Only Models Using Nvidia GPU And Triton Model Server
Blurb: Deep Learning Transformer models are complex in architecture and can have hundreds of millions (or even billions) of parameters, which leads to slow real time inference. Real time low latency inference of Deep Learning models is a critical requirement … Continue reading →
Posted in Uncategorized
|
Tagged AWS, BART, GPU, Low Latency, Model Inferencing, Model Server, Nvidia, Sagemaker, SEQ2SEQ, TensorRT, Triton
|
Leave a comment