llama.cpp Feature -> tensor layer number parameter and separate from layer name

ggml tensors need a layer number parameter I'd use layer 0 for global and 1+ (could also be -1 and 0+ of course)
When a ggml tensor is created the latest configured layer is used, default is 0 ggml_set_current_layer(il+1); This way it's only a single line of code in the eval loop to set the layer of each node.
The model currently contains the name intermixed with the layer like "layers.58.attention.wo.weight" This also should be changed "attention.wo.weight" and layer number set to 59

Benefits:

debug output will be more clean and informative, the layer of each calculation is part of the graph print now (without adding it hardcoded into the generic name)
optimizations can be applied by layer name or weight name in a clean way

May 17 '23 02:05 cmp-nct

Good idea

May 19 '23 17:05 ggerganov

Would love to take this.

May 31 '23 14:05 SiddharthIVEX

Lmk if looks good @cmp-nct https://github.com/ggerganov/llama.cpp/pull/1692

Jun 04 '23 19:06 dkun7944

Good idea

Apologies - I was wrong The better approach to support such type of functionality is through the new extra member of ggml_tensor:

https://github.com/ggerganov/llama.cpp/blob/98ed16557432d7a5179c57eddcc3a08a7ae6d54d/ggml.h#L390

The reason is to keep ggml interface general-purpose. The "layer" concept does not always make sense - i.e. it is specific for particular architectures. Therefore cannot be part of the interface.

Jun 10 '23 07:06 ggerganov

You recall my "meta" recommendation a month or two ago? That's basically what "extra" appears to be. Though as a void pointer that would not be supported by IDE autocompletion, right ?

Wouldn't a simple struct be the better solution than a void pointer ? It's not as "flexible" but unused elements can just stay undefined/default.

Regarding the layer: I originally recommended using layer -1 or 0 for "no layer", so for tensors that are not layer specific that's a way to specify it.

I personally use something like this, it's not 100% up to latest ggml commits since falcon I shifted my focus but basically that should work (just extra instead of meta):

typedef struct
{
    uint8_t layer; // 0 = not set, 1+ = layer num
    char short_name[32]; // parameter weight name without layer
    int8_t flag_high_performance; // if this tensor is high performance (relevant for hybrid core systems like Intel 12/13 gen)
    int8_t flag_use_blas_cuda;    // dst or src0 tensor, -1 = default, 0 = off, 1=on (if contigous)

    int8_t caching_priority;     // -1 no caching, 0 = default, 1+ = priority | caching is by layer number and layer name (0 = global layer)
} tensor_meta;
// default tensor meta:
static const tensor_meta GGML_DEFAULT_TENSOR_META = {
    /*.layer =*/ 0,
   /*.short_name =*/ "",
    /*.flag_high_performance =*/ 1,
    /*.flag_use_blas_cuda =*/ -1, // -1 = use default

    /*.caching_priority =*/ -1, // no caching
};

Jun 10 '23 12:06 cmp-nct