Speed of LLaMa CPU-based Inference Across Select System Configurations 🍅️

This page compares the speed of CPU-only inference across various system and inference configurations when using llama.cpp. The purpose of this page is to shed more light on how configuration changes can affect inference speed.

Measured Metrics

Relative Load Time load_time_median
Token Sample Time sample_time_median
Prompt Token Evaluation Time prompt_eval_time_median
Token Evaluation Time eval_time_median
Relative Total Time total_time_median

Refer to llama.cpp documentation for more information.

System Configuration

AMD 7950x (16c/32t), X670E-E
128GiB DDR5 6400MT/s CL32-39-39-102
SAMSUNG 970 EVO Plus SSD 1TB NVMe M.2 V-NAND

Llamma.cpp Configuration

Version: ac7876a
LLM Models: LLaMa 7B, 13B, 30B, and 65B
CLI Parameters used: -t, -n 40, --ctx-size

System Configuration Variations

128GiB 4 DIMM @ 3?00MT/s, schedutil OS CPU frequency governor.
64GiB 2 DIMM @ 5200MT/s, schedutil OS CPU frequency governor.
64GiB 2 DIMM @ 5200MT/s, performance OS CPU frequency governer.

Llamma.cpp Configuration Variations

Concurrent Instances: 1, 3
Threads: 1 — 20
Contexts: 512, 2048 LLaMA
Quantization: ggml @ q4_0

Working with the Graphs & Data

The horizontal x-axis denotes the number of threads. The vertical y-axis denotes time, measured in milliseconds.

For a less cluttered viewing of the graph, hide all the curves first, then only toggle the curves you want to examine. This is done by rapidly double-clicking on one of the labels in the legend, then clicking once on each curve you want to view.

For example, the label 5200-2dimm-schedutil-3-7B-512-ggml-model-q4_0.bin pertains to a run that was done when the system had 2 DIMMs of ram operating at 5200MT/s, the CPU frequency governor was set to schedutil, 3 separate instances of llama.cpp were running the ggml-model-q4_0.bin version of the 7B model with a 512 context window.

The data used for these graphs is available for download as a zipped archive here. Use the password QVlr1kKzDjc= to access the data.

Speed of LLaMa CPU-based Inference Across Select System Configurations