No need to copy tokens
Ironically with N=1, doing this will move more data than the memcpy (a 64-bit pointer instead of just a 32-bit value).
If we can expose api to create tensor with data, then this will save additional cycles to allocate data of tensor.
If we can expose api to create tensor with data, then this will save additional cycles to allocate data of tensor.
Ironically with N=1, doing this will move more data than the memcpy (a 64-bit pointer instead of just a 32-bit value).
Overhead is more than the data but also memcpy? I think the intention is not perf, but we should not do the copy here.
Redirecting ggml_tensor.data is not a good thing to do. mmap changes set a precedent, but we should avoid it in general