llama.cpp (WIP) Implement stochastic speculative sampling

Closes #5384

This pull request addresses the verification aspect of batched stochastic speculative sampling, following Algorithm 2 outlined in the paper available at: https://arxiv.org/pdf/2305.09781.pdf.
The implementation deviates slightly from the specified method by traversing draft sequences sequentially (from 0 to ~) instead of randomly selecting them (𝑠 ∼ rand(ℋ)). I would appreciate your feedback on whether this alteration should be corrected.
Additionally, a new method llama_sampling_probability_distribution has been introduced in sampling.h to retrieve the probability distribution of the target model for use in residual distribution calculations. While it's noted that the distribution could be obtained through ctx_sampling->cur, it's important to maintain consistency with the distribution when executed through ./main, considering factors such as penalties.
To achieve a comprehensive implementation of stochastic speculative decoding, it's essential to incorporate stochastic drafting for sampling drafts. Feedback is welcomed on how existing parameters like p_split and p_accept should be integrated with stochastic drafting. Once this is clarified, I will refine the drafting code and remove the (WIP) from the PR title.
Seeking validation on the implementation. Kindly provide feedback on any identified issues or concerns. Thank you.

Feb 21 '24 08:02 mscheong01

test run on my m1 pro mac:

./speculative -m models/llama-2-13b.Q5_K_M.gguf --model-draft models/llama-2-13b.Q5_K_M.gguf -p "Simple python quicksort function:\n" -n 200 -e --color --log-enable --temp 1

Feb 21 '24 08:02 mscheong01

I recently worked with these files and should be able to review. However, I'm currently attending a scientific conference and will only be available next week.

Feb 21 '24 08:02 JohannesGaessler

The implementation deviates slightly from the specified method by traversing draft sequences sequentially (from 0 to ~) instead of randomly selecting them (𝑠 ∼ rand(ℋ)). I would appreciate your feedback on whether this alteration should be corrected.

From a quick look, I believe the authors are selecting random child nodes from the draft tree at each depth of the tree. While in the proposed implementation, I think you are testing sequentially the drafted tokens from a list.

In other words, randomly selecting the child nodes makes sense only for n_seq_dft > 1 in which case we draft a tree of tokens. By default, n_seq_dft == 1 and we draft just a list

Does that make sense?

Feb 22 '24 09:02 ggerganov

@ggerganov Yes, that is correct. I was referring to cases where n_seq_dft > 1.

Feb 23 '24 05:02 mscheong01

I see now. I think the proposed approach is equivalent to the one in the paper, but I could be missing something.

To achieve a comprehensive implementation of stochastic speculative decoding, it's essential to incorporate stochastic drafting for sampling drafts.

The p_split parameter is currently used to decide when to create a new branch in the draft tree. Reading the SpecInfer paper, the authors do not present a specific strategy for creating the draft tree:

So this part can remain the same as it is.

Regarding p_accept - I suppose this parameter can be completely removed? Reading the speculative decoding paper (https://arxiv.org/pdf/2211.17192.pdf), we have this:

IIUC we no longer have to draft the tokens greedily, but instead apply the described strategy. In this case, I think the p_accept no longer has meaning

Feb 23 '24 12:02 ggerganov

@ggerganov Understood. I wasn't certain whether using a top-1 or top-k (or p_split) drafting method would affect the equivalence of the output distribution to the target model distribution (I'm still unsure about the mathematical implications), but based on the information from the SpecInfer paper, it seems we can keep it as it is for now. I'll review papers like SpecTr and SpecTr++, which propose optimal drafting and verification methods for batched speculative decoding, and make adjustments as necessary in future PR.

In the meantime, I believe replacing p_accept with a parameter representing the number of draft tokens to sample per decoding step is a sensible decision, given its critical role in influencing the efficiency and performance of the decoding process. I'll make this update and remove the (WIP) tag.

edit: Turns out that we already have parameter n_draft for this so all I would have to do is remove p_accept

Feb 27 '24 05:02 mscheong01

I read the paper and I do not understand how their proposed sampling method can be better than what they call "naive sampling". Fundamentally, if the probability distribution of the sampled tokens is constant, then the probability of the sampled sequence being in the draft tree is also constant. So it doesn't matter what tricks you use for acceptance/rejection, it's not going to make any difference whatsoever.

Is there any evidence that quantitatively shows the method from the paper being superior to naive sampling? In https://github.com/ggerganov/llama.cpp/pull/5479 I implemented some code that tests lookup decoding in terms of e.g. acceptance rate on a large text corpus in order to get sufficient statistics. It may make sense to implement something like this for speculative decoding as well.

The implementation deviates slightly from the specified method by traversing draft sequences sequentially (from 0 to ~) instead of randomly selecting them (𝑠 ∼ rand(ℋ)). I would appreciate your feedback on whether this alteration should be corrected.

In order to preserve the probability distribution of the LLM outputs the order in which the draft sequences are traversed must be random. This is because the algorithm stops at the first accepted continuation. If the order is non-random this therefore biases the drafting towards those tokens with a lower index (with the exact bias being implementation-specific).

Feb 27 '24 11:02 JohannesGaessler

I'm testing the PR using instruction-tuned Codellama and it seems the generation is not OK for temp > 0.

make -j speculative && ./speculative -m ./models/codellama-34b-instruct/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-q4_1.gguf -p "[INST] Write Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage [/INST]" -e -ngl 99 -n 4096 -c 4096 -s 20 --top_k 40 --draft 5 --color -s 1 --no-penalize-nl --repeat-penalty 1.0 --temp 1.0

Here is what we get on master:

 [INST] Write Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage [/INST]  Here is a sample implementation of the Dijkstra algorithm in C++ with 4 space indentation and detailed comments:

[code]
#include <iostream>
#include <vector>
#include <queue>

// A struct to represent a node in the graph
struct Node {
    int id; // the id of the node
    int dist; // the distance of the node from the starting node
    int parent; // the id of the parent node
};

// A struct to represent an edge in the graph
struct Edge {
    int from; // the id of the starting node
    int to; // the id of the ending node
    int weight; // the weight of the edge
};

// A function to add an edge to the graph
void addEdge(std::vector<Node>& nodes, std::vector<Edge>& edges, int from, int to, int weight) {
    // add the edge to the edges vector
    edges.push_back(Edge{ from, to, weight });

    // add the nodes to the nodes vector
    if (nodes.size() < from) {
        nodes.resize(from);
    }
    if (nodes.size() < to) {
        nodes.resize(to);
    }
    nodes[from].id = from;
    nodes[from].dist = 0;
    nodes[from].parent = -1;
    nodes[to].id = to;
    nodes[to].dist = INT_MAX;
    nodes[to].parent = -1;
}

// A function to find the shortest path from the starting node to all other nodes using Dijkstra's algorithm
void dijkstra(std::vector<Node>& nodes, std::vector<Edge>& edges, int start) {
    // create a priority queue to hold the nodes
    std::priority_queue<Node, std::vector<Node>, NodeComparator> queue;

    // initialize the distance of the starting node to 0
    nodes[start].dist = 0;

    // enqueue the starting node
    queue.push(nodes[start]);

    while (!queue.empty()) {
        // dequeue the current node
        Node current = queue.top();
        queue.pop();

        // if the current node has not been visited, visit it
        if (current.id != -1) {
            // mark the current node as visited
            current.id = -1;

            // for each neighbor of the current node
            for (int i = 0; i < edges.size(); i++) {
                // if the neighbor has not been visited, visit it
                if (edges[i].to != current.id) {
                    // calculate the distance of the neighbor
                    int dist = current.dist + edges[i].weight;

                    // if the distance is less than the current distance, update the distance
                    if (dist < nodes[edges[i].to].dist) {
                        nodes[edges[i].to].dist = dist;
                        nodes[edges[i].to].parent = current.id;
                    }
                }
            }

            // enqueue the neighbors of the current node
            for (int i = 0; i < edges.size(); i++) {
                if (edges[i].to != current.id) {
                    queue.push(nodes[edges[i].to]);
                }
            }
        }
    }
}

int main() {
    // create a graph with 5 nodes
    std::vector<Node> nodes;
    nodes.resize(5);

    // add edges to the graph
    addEdge(nodes, edges, 0, 1, 1);
    addEdge(nodes, edges, 0, 2, 2);
    addEdge(nodes, edges, 1, 2, 3);
    addEdge(nodes, edges, 2, 3, 4);
    addEdge(nodes, edges, 3, 4, 5);

    // find the shortest path from node 0 to all other nodes
    dijkstra(nodes, edges, 0);

    // print the shortest path from node 0 to node 4
    std::cout << "Shortest path from node 0 to node 4: " << nodes[4].dist << std::endl;

    return 0;
}
[/code]
This implementation uses a priority queue to hold the nodes, and a vector to represent the edges. The `dijkstra` function takes the graph as input, and returns the shortest path from the starting node to all other nodes. The `addEdge` function adds an edge to the graph, and the `Node` and `Edge` structs represent a node and an edge in the graph, respectively.

To use this implementation, you can simply add edges to the graph using the `addEdge` function, and then call the `dijkstra` function to find the shortest path from the starting node to all other nodes. You can then print the shortest path from the starting node to a specific node using the `nodes` vector.

Note that this implementation assumes that the graph is directed and has positive edge weights. If the graph is undirected, you can simply add edges in both directions to the graph, and if the edge weights are negative, you can use a different data structure to represent the edges, such as a vector of pairs.

encoded   28 tokens in    0.350 seconds, speed:   80.076 t/s
decoded 1242 tokens in   72.499 seconds, speed:   17.131 t/s

n_draft   = 5
n_predict = 1242
n_drafted = 1035
n_accept  = 914
accept    = 88.309%

draft:

llama_print_timings:        load time =     438.57 ms
llama_print_timings:      sample time =    2063.97 ms /     1 runs   ( 2063.97 ms per token,     0.48 tokens per second)
llama_print_timings: prompt eval time =      59.95 ms /    28 tokens (    2.14 ms per token,   467.02 tokens per second)
llama_print_timings:        eval time =   16253.87 ms /  1363 runs   (   11.93 ms per token,    83.86 tokens per second)
llama_print_timings:       total time =   72849.27 ms /  1391 tokens

target:

llama_print_timings:        load time =    3038.51 ms
llama_print_timings:      sample time =      22.88 ms /  1242 runs   (    0.02 ms per token, 54278.47 tokens per second)
llama_print_timings: prompt eval time =   46517.41 ms /  1316 tokens (   35.35 ms per token,    28.29 tokens per second)
llama_print_timings:        eval time =    7679.47 ms /    74 runs   (  103.78 ms per token,     9.64 tokens per second)
llama_print_timings:       total time =   73315.93 ms /  1390 tokens

The result is the same after each run because the seed is fixed to -s 1.

With the PR, the generation is different each time and it seems to not end when it should:

 [INST] Write Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage [/INST]  Here is the code:

[code]
#include <stack>
#include <queue>
#include <iostream>
#include <climits>
#include <vector>
#include <algorithm>
#include <cmath>

using namespace std;

struct Node {
    int x;
    int y;
    int dist;
    Node(int x, int y, int dist) : x(x), y(y), dist(dist) {}
};

int find_min(vector<int> &dist, vector<bool> &visited) {
    int min = INT_MAX;
    int min_index = -1;
    for (int i = 0; i < dist.size(); i++) {
        if (!visited[i] && dist[i] < min) {
            min = dist[i];
            min_index = i;
        }
    }
    return min_index;
}

void dijkstra(vector<vector<int>> &graph, int start, int end) {
    int n = graph.size();
    vector<int> dist(n, INT_MAX);
    vector<bool> visited(n, false);
    vector<Node> nodes;
    dist[start] = 0;
    nodes.push_back(Node(start, 0, 0));
    while (!nodes.empty()) {
        Node node = nodes.back();
        nodes.pop_back();
        int u = node.x;
        int d = node.dist;
        visited[u] = true;
        for (int i = 0; i < graph[u].size(); i++) {
            int v = graph[u][i];
            if (dist[v] > dist[u] + d) {
                dist[v] = dist[u] + d;
                nodes.push_back(Node(v, dist[v], d));
            }
        }
    }
    cout << "Distance from " << start << " to " << end << " is " << dist[end] << endl;
}

int main() {
    int n;
    cin >> n;
    vector<vector<int>> graph(n, vector<int>());
    for (int i = 0; i < n; i++) {
        int m;
        cin >> m;
        for (int j = 0; j < m; j++) {
            int k;
            cin >> k;
            graph[i].push_back(k);
        }
    }
    int start, end;
    cin >> start >> end;
    dijkstra(graph, start, end);
    return 0;
}
[/code]

[Explanation]

The algorithm is pretty simple. First, we initialize the distance array with INT_MAX. Then, we push the starting node into the queue. We keep popping the nodes from the queue and updating the distance array and the queue. We keep doing this until the queue is empty.

[/Explanation]

Here is the sample usage:

5 5
0 1 2
1 2 1
2 3 1
3 4 1
4 0 1
0 4

[Solution]

Distance from 0 to 4 is 4

[/Solution]

[Explanation]

The graph is represented as a 2D array. The first dimension is the node number and the second dimension is the neighboring nodes.

[/Explanation]

[Optimization]

You can optimize the code by using a priority queue instead of a queue.

[/Optimization]

encoded   28 tokens in    0.350 seconds, speed:   80.102 t/s
decoded  863 tokens in   39.322 seconds, speed:   21.947 t/s

n_draft   = 5
n_predict = 863
n_drafted = 745
n_accept  = 713
accept    = 95.705%

draft:

llama_print_timings:        load time =     442.26 ms
llama_print_timings:      sample time =      13.07 ms /   745 runs   (    0.02 ms per token, 57018.22 tokens per second)
llama_print_timings: prompt eval time =      59.84 ms /    28 tokens (    2.14 ms per token,   467.91 tokens per second)
llama_print_timings:        eval time =   10561.18 ms /   895 runs   (   11.80 ms per token,    84.74 tokens per second)
llama_print_timings:       total time =   39672.41 ms /   923 tokens

target:

llama_print_timings:        load time =    3022.01 ms
llama_print_timings:      sample time =    1438.03 ms /   150 runs   (    9.59 ms per token,   104.31 tokens per second)
llama_print_timings: prompt eval time =   27311.96 ms /   921 tokens (   29.65 ms per token,    33.72 tokens per second)
llama_print_timings:        eval time =     101.06 ms /     1 runs   (  101.06 ms per token,     9.89 tokens per second)
llama_print_timings:       total time =   40142.13 ms /   922 tokens

Although it's consistently faster, I think the quality is affected noticeably, but it's a bit difficult to demonstrate.

I think the first goal is to make the results deterministic when temp > 0 and the seed is fixed

Feb 27 '24 13:02 ggerganov

@ggerganov

With the PR, the generation is different each time and it seems to not end when it should

It looks like this is caused by the randomly selected value r at examples/speculative/speculative.cpp#L232. Making this value to be consistent given the same seed will hopefully fix this behavior.

Regarding the quality, It might be from the fact that the output distribution is not equal to the target distribution due to top-1 drafting

one question about the seed: given the same seed, should speculative output the same sequence as main? I'm not sure if that's possible or not at the moment.

Feb 27 '24 17:02 mscheong01

I tried a simple Python script to check whether the method works:

#!/usr/bin/env python3

import numpy as np
import random

SAMPLE_SIZE = 10000
VOCAB_SIZE = 10
BRANCHING_RATIO = 4
TREE_DEPTH = 4

X = np.arange(VOCAB_SIZE)

P_LLM  = np.exp(-X)
P_LLM /= np.sum(P_LLM)

P_SSM  = 1 / (100 - X)
P_SSM /= np.sum(P_SSM)


def sample(probs):
    x = np.random.rand()
    cumsum = 0.0
    for i, p_i in enumerate(probs):
        cumsum += p_i
        if x < cumsum:
            return i
    assert False


n_accept_naive = 0
n_accept_spec_infer = 0

for _ in range(SAMPLE_SIZE):
    trees = [[0]]

    for _ in range(TREE_DEPTH):
        trees_new = []
        for tree in trees:
            sampled_tokens = []
            while len(sampled_tokens) < BRANCHING_RATIO:
                token = sample(P_SSM)
                if token in sampled_tokens:
                    continue

                trees_new.append(tree + [token])
                sampled_tokens.append(token)

        trees = trees_new

    sequence_llm = [0] + [sample(P_LLM) for _ in range(TREE_DEPTH)]

    max_match = 0
    for tree in trees:
        for depth in range(TREE_DEPTH):
            if tree[1:depth+1] == sequence_llm[1:depth+1]:
                max_match = max(max_match, depth)

    n_accept_naive += max_match

    n_accept_i = 0
    norm = 1.0
    while trees and n_accept_i < TREE_DEPTH:
        random.shuffle(trees)

        token = trees[0][n_accept_i]
        if np.random.rand() < (P_LLM[token] / norm) / P_SSM[token]:
            trees = list(filter(lambda t: t[n_accept_i] == token, trees))

            n_accept_i += 1
            norm = 1.0
        else:
            trees = list(filter(lambda t: t[n_accept_i] != token, trees))

            norm -= P_LLM[token]

    n_accept_spec_infer += n_accept_i

print(f"naive: {n_accept_naive / SAMPLE_SIZE}")
print(f"SpecInfer: {n_accept_spec_infer / SAMPLE_SIZE}")

Edit: the above script has a bug!

The results are:

naive: 0.6027
SpecInfer: 3.2117

So assuming my code is correct the method does seem to work. What I think is happening is that even though the ultimate probability distribution of the LLM does not change, the conditional probability distribution of the LLM given the SSM results does change. In essence, because the probabilities of those tokens in the tree get scaled up by dividing them by the probabilities that they end up in the tree in the first place, there is a correlation between the tokens sampled by the SSM and the tokens sampled by the LLM. So even though the output distribution doesn't change the probability of the draft being correct does. It's similar to how the Metropolis-Hastings algorithm exploits autocorrelation to increase the rate of convergence over simple Monte Carlo methods.

Feb 27 '24 17:02 JohannesGaessler

Related discussion: https://github.com/flexflow/FlexFlow/issues/1302

I also noticed that my Python script had a bug regarding the normation. This version should be fixed:

#!/usr/bin/env python3

import numpy as np
import random

SAMPLE_SIZE = 10000
VOCAB_SIZE = 10
BRANCHING_RATIO = 4
TREE_DEPTH = 4

X = np.arange(VOCAB_SIZE)

P_LLM  = np.exp(-X)
P_LLM /= np.sum(P_LLM)

P_SSM  = 1 / (100 - X)
P_SSM /= np.sum(P_SSM)


def sample(probs):
    x = np.random.rand()
    cumsum = 0.0
    for i, p_i in enumerate(probs):
        cumsum += p_i
        if x < cumsum:
            return i
    assert False


n_accept_naive = 0
n_accept_spec_infer = 0

for _ in range(SAMPLE_SIZE):
    trees = [[0]]

    for _ in range(TREE_DEPTH):
        trees_new = []
        for tree in trees:
            sampled_tokens = []
            while len(sampled_tokens) < BRANCHING_RATIO:
                token = sample(P_SSM)
                if token in sampled_tokens:
                    continue

                trees_new.append(tree + [token])
                sampled_tokens.append(token)

        trees = trees_new

    sequence_llm = [0] + [sample(P_LLM) for _ in range(TREE_DEPTH)]

    max_match = 0
    for tree in trees:
        for depth in range(TREE_DEPTH):
            if tree[1:depth+1] == sequence_llm[1:depth+1]:
                max_match = max(max_match, depth)

    n_accept_naive += max_match

    n_accept_i = 0
    p_llm = np.array(P_LLM)

    while trees and n_accept_i < TREE_DEPTH:
        random.shuffle(trees)

        token = trees[0][n_accept_i]
        if np.random.rand() < p_llm[token] / P_SSM[token]:
            trees = list(filter(lambda t: t[n_accept_i] == token, trees))

            n_accept_i += 1
            p_llm = np.array(P_LLM)
        else:
            trees = list(filter(lambda t: t[n_accept_i] != token, trees))

            p_llm = np.maximum(0.0, p_llm - P_SSM)
            p_llm /= np.sum(p_llm)

    n_accept_spec_infer += n_accept_i

print(f"naive: {n_accept_naive / SAMPLE_SIZE}")
print(f"SpecInfer: {n_accept_spec_infer / SAMPLE_SIZE}")

Edit: These are the fixed results:

naive: 0.5967
SpecInfer: 2.4448

Feb 28 '24 12:02 JohannesGaessler

Description of what I did on commit [94f6256]: I attempted to fix the undeterministic output by adding srand() associated with the provided seed value in commit [6afc1f6] However, the output kept changing. On further investigation, it seemed like rand() calls from internal code seemed to interfere with the randomly generated value r from being consistent. So I replaced the rand() call with a separate mt19937 + uniform distribution method. Although I wasn't able to test this thoroughly due to my flimsy test hardware, multiple calls with same seed seem to return equal output. I'll try testing it more when I get my hands on a gpu node this week.

Feb 28 '24 15:02 mscheong01

@JohannesGaessler I've implemented random selection of sequences to verify in commit [2ad3f7c]

make -j && ./speculative -m models/llama-2-13b.Q5_K_M.gguf --model-draft models/llama-2-13b.Q5_K_M.gguf -p "Simple python quicksort function:\n\`\`\`python\n" -n 200 -e --color --log-enable --temp 1 -np 3

logs show that sequences to verify at each step is selected randomly

Feb 29 '24 06:02 mscheong01