The Groq LPU™
Purpose-built for inference performance and precision, all in a simple, efficient design
The demand for LLMs is accelerating and current processors can’t handle the speed and the demand required. The GPU is the weakest link in the generative AI ecosystem.
We’ve developed the LPU™ Inference Engine, an end-to-end inference acceleration system, to deliver substantial performance, efficiency, and precision all in a simple design, created and engineered in North America.
LET US KNOW IF YOU'D LIKE TO LEARN MORE ABOUT GROQ.
This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user.
The LPU resides in the data center alongside CPUs and Graphics Processors that enable training and customers can choose on-premise deployment or API access. Our vision is to set a new expectation for what the AI experience should be: Inference that wows with low latency and real-time delivery all in an energy-efficient package. Our promise to customers, partners, and prompters is always and forever to make it real.
What is an LPU™ inference engine and why is it good for LLMs?
An LPU™ Inference Engine, with LPU standing for Language Processing Unit™, is a new type of processing system invented by Groq to handle computationally intensive applications with a sequential component to them such as LLMs.
LPU Inference Engines are designed to overcome the two bottlenecks for LLMs–the amount of compute and memory bandwidth. An LPU system has as much or more compute as a Graphics Processor (GPU) and reduces the amount of time per word calculated, allowing faster generation of text sequences. With no external memory bandwidth bottlenecks an LPU Inference Engine delivers orders of magnitude better performance than Graphics Processor.
What is the definition of an LPU™ inference engine?
An LPU™ inference engine has the following characteristics:
- Exceptional sequential performance
- Single core architecture
- Synchronous networking that is maintained even for large scale deployments
- Ability to auto-compile >50B LLMs
- Instant memory access
- High accuracy that is maintained even at lower precision levels
What is Groq's performance on an LPU™ inference engine?
Previously, we’ve set the records of 100 tokens per second per user, followed by 240 tokens per second per user. Groq recently published performance results of over 300 tokens per second per user on Llama 2 70B running on an LPU™ system.