Speed of LLaMa CPU-based Inference Across Select System Configurations

This page compares the speed of CPU-only inference across various system and inference configurations when using llama.cpp. The purpose of this page is to shed more light on how configuration changes can affect inference speed.

Measured Metrics

Refer to llama.cpp documentation for more information.

System Configuration

Llamma.cpp Configuration

System Configuration Variations

Llamma.cpp Configuration Variations

Working with the Graphs & Data

The graphs on this page are best viewed on a Desktop computer.

The horizontal x-axis denotes the number of threads. The vertical y-axis denotes time, measured in milliseconds.

For a less cluttered viewing of the graph, hide all the curves first, then only toggle the curves you want to examine. This is done by rapidly double-clicking on one of the labels in the legend, then clicking once on each curve you want to view.

Curve labels format is:

        RAMSPEED-DIMMCOUNT-FREQGOV-NINSTANCE-PARAM-CTX-MODEL   
        

For example, the label 5200-2dimm-schedutil-3-7B-512-ggml-model-q4_0.bin pertains to a run that was done when the system had 2 DIMMs of ram operating at 5200MT/s, the CPU frequency governor was set to schedutil, 3 separate instances of llama.cpp were running the ggml-model-q4_0.bin version of the 7B model with a 512 context window.

The data used for these graphs is available for download as a zipped archive here. Use the password QVlr1kKzDjc= to access the data.

These graphs are best viewed while consuming at least one tomato 🍅️.

Interactive Graphs