FedAgent Logo FedAgent
FedAgent

Federated Agent Reinforcement Learning

FedAgent Framework Overview

Canyu Chen1*, Kangyu Zhu2*, Zhaorun Chen3, Zhanhui Zhou4, Shizhe Diao5

Yiping Lu1, Tian Li3, Manling Li1†, Dawn Song4†

*Equal contribution, †Equal advising

1Northwestern University, 2Brown University, 3The University of Chicago, 4University of California, Berkeley, 5NVIDIA Research

Abstract

LLM-powered AI agents are trained on real-world user interactions, such as customer purchase queries in e-commerce (WebShop) or household routines in embodied environments (ALFWorld), data that is inherently private and cannot be centrally pooled. Can we train AI agents while protecting users' data privacy? We introduce FedAgent (Federated Agent Reinforcement Learning), a decentralized training paradigm where each client runs local RL, sends only model parameters to a server for aggregation, and keeps all data local. Unlike conventional federated RL with low-dimensional state/action spaces, FedAgent tackles the unique challenges of natural language state and action spaces, diverse tasks, and complex environment interactions. We construct FedAgentGym, the first decentralized agent learning environment with four LLM agents, two application scenarios, three decentralized settings, and a new two-level agent heterogeneity framework: Task Heterogeneity (Preference, Coverage, Hardness, i.e. what types, how many, and how hard are each client's tasks) and Environment Heterogeneity (clients interact with fundamentally different worlds). Empirically, FedAgent matches centralized training and is robust to all three axes of task heterogeneity, but environment mismatch alone collapses the shared policy to near-zero. We provide a theoretical explanation via a convergence bound that decomposes error into stochastic noise, task-level drift, and environment-level drift, and a σ²-Dominance Theory showing that the enormous gradient noise from exponential action spaces absorbs task divergence but cannot absorb the systematic conflict of environment divergence.

Algorithm

Algorithm 1: FedAgent with Client and Server Training
Require: Total clients $K$, rounds $T$, clients-per-round $M$, local steps $\tau$, learning rate $\eta$
Ensure: Final LLM-based global policy parameters $\theta_{\mathrm{final}}$
1: Initialize global policy parameters $\theta_{0}$ (an LLM)
2: for $t = 0$ to $T{-}1$ do
3:   Server: sample client subset $S_t \subset [K]$ with $|S_t|=M$ (uniform without replacement)
4:   Server: broadcast $\theta_t$ to all $k \in S_t$
5:   for each $k \in S_t$ in parallel do
6:     Set local iterate $\theta_{k,t,0} \gets \theta_t$
7:     for $i = 0$ to $\tau{-}1$ do
8:       Collect mini-batch of trajectories $B_{k,t,i}$ using policy $\pi_{\theta_{k,t,i}}$ in env $\mathcal{M}_k$
9:       Estimate policy gradient: $g_{k,t,i} \gets \nabla_{\theta} \hat{J}_k(\theta_{k,t,i};\, B_{k,t,i})$ (e.g., GRPO)
10:      Local update: $\theta_{k,t,i+1} \gets \theta_{k,t,i} + \eta\, g_{k,t,i}$
11:     end for
12:     Client returns local model $\theta_{k,t,\tau}$ (equivalently $\Delta\theta_{k,t}=\theta_{k,t,\tau}-\theta_t$)
13:   end for
14:   Server: Aggregation via model averaging:
$$\theta_{t+1} \gets \frac{1}{M}\sum_{k \in S_t} \theta_{k,t,\tau}$$
15: end for
16: return $\theta_{\mathrm{final}} \gets \theta_T$

Client Partitioning Strategies

We define three novel client partitioning methods to systematically study heterogeneity challenges in decentralized agent learning.

Algorithm 2: PreferencePartition
Require: Category pools $\{\mathcal{I}_c\}$ with sizes $n_c$; clients $K$; per-client size $L$; jitter $\omega$
1: Compute global mix $p_c \gets n_c / \textstyle\sum n_j$ and logit anchors $\ell_c$
2: for $k = 1$ to $K$ do
3:   Sample $z_c \sim \mathcal{N}(\ell_c, \omega^2)$, softmax to get $q_c$ (larger $\omega$ $\Rightarrow$ higher variance)
4:   Draw category counts via $\mathrm{Multinomial}(L;\, q)$
5:   Sample items without replacement per category
6: end for
Algorithm 3: CoveragePartition
Require: Items $N$; clients $K$; bounds $(L_{\min}, L_{\mathrm{avg}}, L_{\max})$; dispersion $\xi$; overlap $r$
1: Compute total assignments $T = \lfloor rN \rfloor$
2: Sample sizes from $\mathrm{Beta}(\alpha, \beta)$ (larger $\xi$ $\Rightarrow$ lower variance)
3: Rescale and round sizes to sum $T$ within bounds
4: for each item do
5:   Place item into clients with $\Pr(k) \propto \mathrm{rem}_k$
6: end for
Algorithm 4: HardnessPartition
Require: Items $N$; sets $\mathcal{S}$ (success), $\mathcal{U}$ (fail); clients $K$; size $L$; dispersion $\xi'$
1: Partition $\mathcal{S}$ via CoveragePartition (larger $\xi'$ $\Rightarrow$ lower variance)
2: for $k = 1$ to $K$ do
3:   Fill remaining slots with items from $\mathcal{U}$
4:   $X_k \gets Y_k \cup F_k$
5: end for

Main Results

Performance comparison of Local Training, Centralized Training, and FedAgent across different LLM agents on ALFWorld and WebShop benchmarks.

Local Training Centralized Training FedAgent (Ours)

Qwen2.5-1.5B-Instruct

Method ALFWorld (Success Rate %) WebShop
Pick Look Clean Heat Cool Pick2 All Score Succ.
Local (Client 21) 42.925.038.537.514.314.329.769.957.0
Local (Client 42) 50.037.576.925.042.914.345.375.153.1
Local (Client 84) 50.037.546.225.028.60.034.472.747.7
Centralized 64.337.569.250.042.928.651.679.957.8
FedAgent 80.075.053.837.583.350.064.183.261.7

Qwen2.5-3B-Instruct

Method ALFWorld (Success Rate %) WebShop
Pick Look Clean Heat Cool Pick2 All Score Succ.
Local (Client 21) 41.512.534.951.018.921.231.359.855.0
Local (Client 42) 46.537.524.415.033.733.328.261.359.3
Local (Client 84) 22.827.539.146.348.336.529.977.658.6
Centralized 94.180.064.342.950.022.262.586.063.9
FedAgent 95.562.549.747.585.345.165.285.563.1

Qwen2.5-7B-Instruct

Method ALFWorld (Success Rate %) WebShop
Pick Look Clean Heat Cool Pick2 All Score Succ.
Local (Client 21) 35.525.061.025.935.845.238.470.949.2
Local (Client 42) 29.045.018.825.615.938.042.185.233.6
Local (Client 84) 34.747.544.451.340.121.835.760.639.3
Centralized 93.782.571.547.963.231.973.378.864.7
FedAgent 94.585.056.062.586.742.875.589.068.9

Llama-3.2-3B-Instruct

Method ALFWorld (Success Rate %) WebShop
Pick Look Clean Heat Cool Pick2 All Score Succ.
Local (Client 21) 39.850.017.940.020.734.038.165.350.5
Local (Client 42) 18.255.041.934.341.025.035.067.051.0
Local (Client 84) 29.932.539.018.918.837.629.770.255.7
Centralized 72.462.559.345.253.727.954.976.356.2
FedAgent 83.757.560.655.965.324.961.274.457.8

Experiments

FedAgent Matches Centralized Training

FedAgent with 100 clients and 2 selected per round achieves comparable success rates to centralized training on all pooled data, despite never sharing local data.

FedAgent vs Centralized Training

Training Dynamics in Different Decentralized Settings

FedAgent exhibits high sensitivity on the number of clients selected per round and epochs per client per round, while showing relatively low sensitivity to the number of samples per client.

Training Dynamics of FedAgent in Different Decentralized Settings

Two-Level Heterogeneity Analysis

Task mismatch alone barely affects training, but environment mismatch collapses the shared policy to near-zero, revealing the critical role of environment-level heterogeneity.

WebShop: Task vs Environment Heterogeneity

(a) WebShop

ALFWorld: Task vs Environment Heterogeneity

(b) ALFWorld

Theoretical Analysis

Why is FedAgent robust to task heterogeneity but fragile to environment heterogeneity?

Convergence Bound

We derive a convergence bound for FedAgent that decomposes the optimization error into three interpretable terms:

$$\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla J(\theta_t)\|^2 \;\leq\; \underbrace{\mathcal{O}\!\left(\frac{\sigma^2}{MT\tau}\right)}_{\color{#2e7d32}{\text{(I) Stochastic noise}}} \;+\; \underbrace{\mathcal{O}\!\left(\tau^2 L^2 \delta_{\text{data}}^2\right)}_{\color{#1565c0}{\text{(II) Task-level drift}}} \;+\; \underbrace{\mathcal{O}\!\left(\tau^2 L^2 \delta_{\text{env}}^2\right)}_{\color{#c62828}{\text{(III) Env-level drift}}}$$
$\sigma^2$
Gradient Noise
How noisy is each gradient?
$\delta_{\text{data}}^2$
Task Divergence
How different are clients' tasks?
$\delta_{\text{env}}^2$
Env Divergence
How different are clients' worlds?

$\sigma^2$-Dominance Theory

In LLM agent settings, the exponential action space makes gradient noise $\sigma^2$ enormous. This has a surprising consequence:

$\sigma^2$
Gradient Noise
Enormous —
exponential action space.
$\delta_{\text{data}}^2$
Bounded Task Divergence
Absorbed by noise.
$\delta_{\text{env}}^2$
Unbounded Env Divergence
Systematic conflict. Not absorbed.
$\;\rightarrow\; \infty$
Key Insight: Intrinsic Robustness against Task Heterogeneity. Fragility towards Environment Heterogeneity.

Client Distributions under Partitioning Strategies

We visualize the client distributions under our three partitioning strategies across both WebShop and ALFWorld benchmarks.

Preference Preference Heterogeneity

Each client draws items from a client-specific categorical distribution over task types, controlled by jitter parameter ω. A larger ω produces more skewed per-client category mixes, modeling scenarios where different users strongly prefer distinct types of tasks.

WebShop Preference Heterogeneity

WebShop (ω = 0.1 vs. ω = 0.9)

ALFWorld Preference Heterogeneity

ALFWorld (ω = 0.1 vs. ω = 0.9)

Coverage Coverage Heterogeneity

Each client receives a different number of task instructions, controlled by dispersion parameter ξ. A smaller ξ leads to higher variance in client dataset sizes, modeling scenarios where some users have extensive task collections while others have very few.

WebShop Coverage Heterogeneity

WebShop (ξ = 1 vs. ξ = 256)

ALFWorld Coverage Heterogeneity

ALFWorld (ξ = 1 vs. ξ = 256)

Hardness Hardness Heterogeneity

Each client receives a different mix of easy (previously successful) and hard (previously unsuccessful) tasks, controlled by parameter ξ'. A smaller ξ' leads to higher variance in difficulty levels across clients, modeling scenarios where some users face predominantly easy tasks while others encounter mostly hard tasks.

WebShop Hardness Heterogeneity

WebShop (ξ' = 1 vs. ξ' = 256)

ALFWorld Hardness Heterogeneity

ALFWorld (ξ' = 1 vs. ξ' = 256)

Citation

If you find our work useful in your research, please cite:

@article{chen2025fedagent,
  title={Federated Agent Reinforcement Learning},
  author={Canyu Chen and Kangyu Zhu and Zhaorun Chen and Zhanhui Zhou and Shizhe Diao and Yiping Lu and Tian Li and Manling Li and Dawn Song},
  year={2026},
  journal={arXiv preprint},
}