ArtificialAnalysis.ai LLM Benchmark Doubles Axis To Fit New Groq LPU™ Inference Engine Performance Results

Written by:
Groq

Groq Represents a “Step Change” in Inference Speed Performance According to ArtificialAnalysis.ai

We’re opening the second month of the year with our second LLM benchmark, this time by ArtificialAnalysis.ai. Spoiler: The Groq LPU™ Inference Engine performed so well that the chart axes had to be extended to plot Groq on the Latency vs. Throughput chart. But before we dive into the results, let’s talk about the setup.

This benchmark is an analysis of Meta AI’s Llama 2 Chat (70B) across metrics including quality, latency, throughput tokens per second, price, and others. Groq joined other API Host providers including Microsoft Azure, Amazon Bedrock, Perplexity, Together.ai, Anyscale, Deepinfra, Fireworks, and Lepton.

Conducted independently, ArtifiicalAnalysis.ai benchmarks compare the hosting providers across key performance indicators including throughput versus price, latency versus throughput, throughput over time, total response time, and throughput variance. The benchmarks are ‘live’ meaning they’re updated every three hours (eight times per day) and prompts are unique, around 100 tokens in length, and generate ~200 output tokens. This test design is meant to reflect real-world usage and measures changes to throughput (tokens per second) and latency (time to first token) over time. ArtificialAnalyis.ai also has other benchmarks with longer prompts to reflect retrieval augmented generation (RAG) use cases. 

 
Results 
ArtificialAnalysis.ai benchmarks show Groq outperforming other providers in almost every category, especially regarding throughput and total response time to receive 100 output tokens, in which Groq delivered 241 tokens per second and 0.8 seconds, respectively. 
Below is a closer look at some of the results.
Latency vs Throughput
  • Latency: Time to first tokens chunk received, in seconds, after API request sent
  • Throughput: Token per second received while the model is generating tokens (ie after the first chunk has been received from the API)
  • Lower and to the right is better, with the green representing the most attractive quadrant

Throughput

  • Throughput: Token per second received while the model is generating tokens (ie after the first chunk has been received from the API)
  • Higher is better
Total Response Time
  • Total response time: Time to receive a 100 token response, estimated based on latency and throughput
  • Lower is better

The LPU Inference Engine is available GroqCloud™, which offers multiple levels of API access. Learn more here. 

Never miss a Groq update! Sign up below for our latest news.

The latest Groq news. Delivered to your inbox.