Clarification on GPU thread-group sizing, GPU utilization, and runtime estimation in MMseqs2-GPU

Open slee-ai opened this issue 2 months ago • 1 comments

I am using MMseqs2-GPU for large-scale MSA generation (millions of queries) and would like to better understand GPU behavior so I can estimate runtime and GPU-hours accurately.

CPU --threads vs GPU threads

Can you confirm that --threads controls only CPU threads and does not affect GPU threads?

GPU cores / SMs / thread-groups

The GPU paper states that MMseqs2-GPU uses thread-groups of size 4, 8, 16, or 32, each processing one alignment/tile. How is thread-group size chosen internally? Does it depend on query length, target length, tile size, GPU architecture, or a fixed heuristic?
Does MMseqs2-GPU always attempt to fully occupy all SMs, or can some SMs remain idle depending on sequence lengths or batch size?

Runtime / TCUPS estimation in practice

The GPU paper reports TCUPS using synthetic databases where query and target lengths match. For real databases (e.g., UniRef30 2023, envDB 202108), is there a recommended method to estimate runtime using TCUPS?

VRAM usage and host-memory streaming

When MMseqs2-GPU requires more VRAM than what is available, does it spill to host memory?

Best practices for large-batch GPU searches

Are there recommended batch sizes to maximize GPU utilization and for fastest MSA computation?
Aside from --gpu and --threads, are there GPU-specific user specified parameters?

Thank you very much! Having clarity on these points would be extremely helpful for accurate GPU resource planning.

Nov 26 '25 02:11 slee-ai

I can give you some insights regarding the GPU implementation.

The selected matrix tile size and its corresponding group size / num items depends on query length and the GPU architecture. We performed a grid search to find the best performing groupsize / num items per tile size for different GPU architectures. Our tuning configs are located here: https://github.com/soedinglab/MMseqs2/tree/master/lib/libmarv/tuningconfigs and implemented here: https://github.com/soedinglab/MMseqs2/blob/master/lib/libmarv/src/gapless_kernel_config.cuh

We launch as many thread blocks as required. For any common database, you should assume that all SMs will be occupied by this.

There is no ready equation, but here are some thoughts. Performance depends on both the query length, and the target lengths in the database. If a tile size T has a performance of X TCUPS, then using this tile size for a query of length L <= T will give at most (L/T) * X TCUPS. Furthermore, performance decreases if targets have different lengths since this may lead to load imbalance between warps/threadblocks. We try to minimize this effect by sorting the database sequences by length during database construction.

If the database does not fit into GPU memory, the database is processed in chunks. Processing of chunk C overlaps with with CPU->GPU transfer of chunk C+1. Depending on query length, the performance will then be limited by the transfer speed. For example, say we can achieve a transfer rate of 20*10^9 amino acids per second and the query length is 250. Processing with 12 TCUPS performance will take (20×10^9×250)÷(12*10^12) = 0.42 seconds < 1 second transfer time.

Nov 27 '25 08:11 fkallen