RL Asynchronous or in-loop tool-calling

Is your feature request related to a problem? Please describe. Add native support in NeMo RL for asynchronous or in-loop tool-calling, where the inference engine can autonomously invoke external tools during generation and seamlessly resume once tool responses arrive.

This enables more efficient training with tool-augmented models (e.g., using search, code execution, or retrieval tools) by allowing:

Non-blocking generation — other generations can continue while a tool response is pending.
Immediate resume — generation automatically continues once a tool result is received.
Improved throughput — critical for training with slower tools (responses may take 1–2 minutes).

Describe the solution you'd like Integrate or expose an async tool-calling interface in NeMo RL’s inference/training stack — allowing tool calls to suspend, await responses, and resume generation automatically without blocking other rollouts or samples.

Describe alternatives you've considered vLLM has function calling / tool calling support. You can register functions (tools) with the model and it can generate a function call during inference. vLLM itself does not automatically suspend generation and resume when a tool result is pending. However, its callback-based streaming API allows you to implement this externally.

Additional context This request originated from our engagement with Allen Institute for AI or Ai2's post-training team as part of the Nvidia's partnership with Ai2 (partnership announcement).

Oct 17 '25 19:10 sugsharma

@parthchadha - It feels that we have some pieces, but maybe not everything in a single place. What are the missing stuff?

Oct 17 '25 20:10 snowmanwwg

@parthchadha - It feels that we have some pieces, but maybe not everything in a single place. What are the missing stuff?

@parthchadha it would be great to hear your thoughts and find a way to enable fully asynchronous / in-loop tool-calling in NeMo RL as described above.

Oct 30 '25 18:10 sugsharma

@sugsharma nemo-rl already uses async vllm engines and the tool calls are pipelined such that other generations are not blocked. https://github.com/NVIDIA-NeMo/RL/blob/main/examples/configs/grpo_math_1B.yaml#L211 controls the async engine behavior, https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/experience/rollouts.py#L773 for multi turn async rollouts.

Oct 30 '25 19:10 parthchadha

Hi @parthchadha @snowmanwwg thanks for the pointers. Do we have any sample notebooks or tutorials that use the above code to illustrate how one can implement asynchronous/in-loop tool calling using NeMo RL?

Nov 06 '25 16:11 sugsharma

@sugsharma we dont, do you want to help us with that? :)

Dec 03 '25 07:12 snowmanwwg