This page compares the speed of CPU-only inference across various system and inference configurations when using llama.cpp. The purpose of this page is to shed more light on how configuration changes can affect inference speed.
load_time_median
sample_time_median
prompt_eval_time_median
eval_time_median
total_time_median
Refer to llama.cpp documentation for more information.
schedutil
OS CPU frequency governor.
schedutil
OS CPU frequency governor.
performance
OS CPU frequency governer.
The graphs on this page are best viewed on a Desktop computer.
The horizontal x-axis denotes the number of threads. The vertical y-axis denotes time, measured in milliseconds.
For a less cluttered viewing of the graph, hide all the curves first, then only toggle the curves you want to examine. This is done by rapidly double-clicking on one of the labels in the legend, then clicking once on each curve you want to view.
Curve labels format is:
RAMSPEED-DIMMCOUNT-FREQGOV-NINSTANCE-PARAM-CTX-MODEL
For example, the label 5200-2dimm-schedutil-3-7B-512-ggml-model-q4_0.bin
pertains to a run that was done when the system had 2 DIMMs of ram operating at 5200MT/s, the CPU frequency governor was set to schedutil
, 3 separate instances of llama.cpp were running the ggml-model-q4_0.bin version of the 7B model with a 512 context window.
The data used for these graphs is available for download as a zipped archive here. Use the password QVlr1kKzDjc=
to access the data.
These graphs are best viewed while consuming at least one tomato 🍅️.