FedAgent: Federated Agent Reinforcement Learning

Chen, Canyu; Zhu, Kangyu; Chen, Zhaorun; Zhou, Zhanhui; Diao, Shizhe; Lu, Yiping; Li, Tian; Li, Manling; Song, Dawn

FedAgent

Federated Agent Reinforcement Learning

arXiv (coming soon) Paper (PDF) Code (coming soon)

Canyu Chen^1*, Kangyu Zhu^2*, Zhaorun Chen³, Zhanhui Zhou⁴, Shizhe Diao⁵

Yiping Lu¹, Tian Li³, Manling Li^1†, Dawn Song^4†

*Equal contribution, †Equal advising

¹Northwestern University, ²Brown University, ³The University of Chicago, ⁴University of California, Berkeley, ⁵NVIDIA Research

Abstract

LLM-powered AI agents are trained on real-world user interactions, such as customer purchase queries in e-commerce (WebShop) or household routines in embodied environments (ALFWorld), data that is inherently private and cannot be centrally pooled. Can we train AI agents while protecting users' data privacy? We introduce FedAgent (Federated Agent Reinforcement Learning), a decentralized training paradigm where each client runs local RL, sends only model parameters to a server for aggregation, and keeps all data local. Unlike conventional federated RL with low-dimensional state/action spaces, FedAgent tackles the unique challenges of natural language state and action spaces, diverse tasks, and complex environment interactions. We construct FedAgentGym, the first decentralized agent learning environment with four LLM agents, two application scenarios, three decentralized settings, and a new two-level agent heterogeneity framework: Task Heterogeneity (Preference, Coverage, Hardness, i.e. what types, how many, and how hard are each client's tasks) and Environment Heterogeneity (clients interact with fundamentally different worlds). Empirically, FedAgent matches centralized training and is robust to all three axes of task heterogeneity, but environment mismatch alone collapses the shared policy to near-zero. We provide a theoretical explanation via a convergence bound that decomposes error into stochastic noise, task-level drift, and environment-level drift, and a σ²-Dominance Theory showing that the enormous gradient noise from exponential action spaces absorbs task divergence but cannot absorb the systematic conflict of environment divergence.

Algorithm

Algorithm 1: FedAgent with Client and Server Training

Require: Total clients $K$, rounds $T$, clients-per-round $M$, local steps $\tau$, learning rate $\eta$

Ensure: Final LLM-based global policy parameters $\theta_{\mathrm{final}}$

1: Initialize global policy parameters $\theta_{0}$ (an LLM)

2: for $t = 0$ to $T{-}1$ do

3: Server: sample client subset $S_t \subset [K]$ with $|S_t|=M$ (uniform without replacement)

4: Server: broadcast $\theta_t$ to all $k \in S_t$

5: for each $k \in S_t$ in parallel do

6: Set local iterate $\theta_{k,t,0} \gets \theta_t$

7: for $i = 0$ to $\tau{-}1$ do

8: Collect mini-batch of trajectories $B_{k,t,i}$ using policy $\pi_{\theta_{k,t,i}}$ in env $\mathcal{M}_k$

9: Estimate policy gradient: $g_{k,t,i} \gets \nabla_{\theta} \hat{J}_k(\theta_{k,t,i};\, B_{k,t,i})$ (e.g., GRPO)

10: Local update: $\theta_{k,t,i+1} \gets \theta_{k,t,i} + \eta\, g_{k,t,i}$

11: end for

12: Client returns local model $\theta_{k,t,\tau}$ (equivalently $\Delta\theta_{k,t}=\theta_{k,t,\tau}-\theta_t$)

13: end for

14: Server: Aggregation via model averaging:

$$\theta_{t+1} \gets \frac{1}{M}\sum_{k \in S_t} \theta_{k,t,\tau}$$

15: end for

16: return $\theta_{\mathrm{final}} \gets \theta_T$

Client Partitioning Strategies

We define three novel client partitioning methods to systematically study heterogeneity challenges in decentralized agent learning.

Algorithm 2: PreferencePartition

Require: Category pools $\{\mathcal{I}_c\}$ with sizes $n_c$; clients $K$; per-client size $L$; jitter $\omega$

1: Compute global mix $p_c \gets n_c / \textstyle\sum n_j$ and logit anchors $\ell_c$

2: for $k = 1$ to $K$ do

3: Sample $z_c \sim \mathcal{N}(\ell_c, \omega^2)$, softmax to get $q_c$ (larger $\omega$ $\Rightarrow$ higher variance)

4: Draw category counts via $\mathrm{Multinomial}(L;\, q)$

5: Sample items without replacement per category

6: end for

Algorithm 3: CoveragePartition

Require: Items $N$; clients $K$; bounds $(L_{\min}, L_{\mathrm{avg}}, L_{\max})$; dispersion $\xi$; overlap $r$

1: Compute total assignments $T = \lfloor rN \rfloor$

2: Sample sizes from $\mathrm{Beta}(\alpha, \beta)$ (larger $\xi$ $\Rightarrow$ lower variance)

3: Rescale and round sizes to sum $T$ within bounds

4: for each item do

5: Place item into clients with $\Pr(k) \propto \mathrm{rem}_k$

6: end for

Algorithm 4: HardnessPartition

Require: Items $N$; sets $\mathcal{S}$ (success), $\mathcal{U}$ (fail); clients $K$; size $L$; dispersion $\xi'$

1: Partition $\mathcal{S}$ via CoveragePartition (larger $\xi'$ $\Rightarrow$ lower variance)

2: for $k = 1$ to $K$ do

3: Fill remaining slots with items from $\mathcal{U}$

4: $X_k \gets Y_k \cup F_k$

5: end for

Main Results

Performance comparison of Local Training, Centralized Training, and FedAgent across different LLM agents on ALFWorld and WebShop benchmarks.

Local Training Centralized Training FedAgent (Ours)

Qwen2.5-1.5B-Instruct

Method	ALFWorld (Success Rate %)							WebShop
Method	Pick	Look	Clean	Heat	Cool	Pick2	All	Score	Succ.
Local (Client 21)	42.9	25.0	38.5	37.5	14.3	14.3	29.7	69.9	57.0
Local (Client 42)	50.0	37.5	76.9	25.0	42.9	14.3	45.3	75.1	53.1
Local (Client 84)	50.0	37.5	46.2	25.0	28.6	0.0	34.4	72.7	47.7
Centralized	64.3	37.5	69.2	50.0	42.9	28.6	51.6	79.9	57.8
FedAgent	80.0	75.0	53.8	37.5	83.3	50.0	64.1	83.2	61.7

Qwen2.5-3B-Instruct

Method	ALFWorld (Success Rate %)							WebShop
Method	Pick	Look	Clean	Heat	Cool	Pick2	All	Score	Succ.
Local (Client 21)	41.5	12.5	34.9	51.0	18.9	21.2	31.3	59.8	55.0
Local (Client 42)	46.5	37.5	24.4	15.0	33.7	33.3	28.2	61.3	59.3
Local (Client 84)	22.8	27.5	39.1	46.3	48.3	36.5	29.9	77.6	58.6
Centralized	94.1	80.0	64.3	42.9	50.0	22.2	62.5	86.0	63.9
FedAgent	95.5	62.5	49.7	47.5	85.3	45.1	65.2	85.5	63.1

Qwen2.5-7B-Instruct

Method	ALFWorld (Success Rate %)							WebShop
Method	Pick	Look	Clean	Heat	Cool	Pick2	All	Score	Succ.
Local (Client 21)	35.5	25.0	61.0	25.9	35.8	45.2	38.4	70.9	49.2
Local (Client 42)	29.0	45.0	18.8	25.6	15.9	38.0	42.1	85.2	33.6
Local (Client 84)	34.7	47.5	44.4	51.3	40.1	21.8	35.7	60.6	39.3
Centralized	93.7	82.5	71.5	47.9	63.2	31.9	73.3	78.8	64.7
FedAgent	94.5	85.0	56.0	62.5	86.7	42.8	75.5	89.0	68.9

Llama-3.2-3B-Instruct

Method	ALFWorld (Success Rate %)							WebShop
Method	Pick	Look	Clean	Heat	Cool	Pick2	All	Score	Succ.
Local (Client 21)	39.8	50.0	17.9	40.0	20.7	34.0	38.1	65.3	50.5
Local (Client 42)	18.2	55.0	41.9	34.3	41.0	25.0	35.0	67.0	51.0
Local (Client 84)	29.9	32.5	39.0	18.9	18.8	37.6	29.7	70.2	55.7
Centralized	72.4	62.5	59.3	45.2	53.7	27.9	54.9	76.3	56.2
FedAgent	83.7	57.5	60.6	55.9	65.3	24.9	61.2	74.4	57.8

Experiments

FedAgent Matches Centralized Training

FedAgent with 100 clients and 2 selected per round achieves comparable success rates to centralized training on all pooled data, despite never sharing local data.

Training Dynamics in Different Decentralized Settings

FedAgent exhibits high sensitivity on the number of clients selected per round and epochs per client per round, while showing relatively low sensitivity to the number of samples per client.

Training Dynamics of FedAgent in Different Decentralized Settings

Two-Level Heterogeneity Analysis

Task mismatch alone barely affects training, but environment mismatch collapses the shared policy to near-zero, revealing the critical role of environment-level heterogeneity.

WebShop: Task vs Environment Heterogeneity

(a) WebShop

ALFWorld: Task vs Environment Heterogeneity

(b) ALFWorld

Theoretical Analysis

Why is FedAgent robust to task heterogeneity but fragile to environment heterogeneity?

Convergence Bound

We derive a convergence bound for FedAgent that decomposes the optimization error into three interpretable terms:

$$\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}\|\nabla J(\theta_t)\|^2 \;\leq\; \underbrace{\mathcal{O}\!\left(\frac{\sigma^2}{MT\tau}\right)}_{\color{#2e7d32}{\text{(I) Stochastic noise}}} \;+\; \underbrace{\mathcal{O}\!\left(\tau^2 L^2 \delta_{\text{data}}^2\right)}_{\color{#1565c0}{\text{(II) Task-level drift}}} \;+\; \underbrace{\mathcal{O}\!\left(\tau^2 L^2 \delta_{\text{env}}^2\right)}_{\color{#c62828}{\text{(III) Env-level drift}}}$$

\sigma^2

Gradient Noise

How noisy is each gradient?

\delta_{\text{data}}^2

Task Divergence

How different are clients' tasks?

\delta_{\text{env}}^2

Env Divergence

How different are clients' worlds?

$\sigma^2$-Dominance Theory

In LLM agent settings, the exponential action space makes gradient noise $\sigma^2$ enormous. This has a surprising consequence:

\sigma^2

Gradient Noise

Enormous —
exponential action space.

\delta_{\text{data}}^2

Bounded Task Divergence

Absorbed by noise.

\delta_{\text{env}}^2

Unbounded Env Divergence

Systematic conflict. Not absorbed.

$\;\rightarrow\; \infty$

Key Insight: Intrinsic Robustness against Task Heterogeneity. Fragility towards Environment Heterogeneity.

Client Distributions under Partitioning Strategies

We visualize the client distributions under our three partitioning strategies across both WebShop and ALFWorld benchmarks.

Preference Preference Heterogeneity

Each client draws items from a client-specific categorical distribution over task types, controlled by jitter parameter ω. A larger ω produces more skewed per-client category mixes, modeling scenarios where different users strongly prefer distinct types of tasks.

WebShop (ω = 0.1 vs. ω = 0.9)

ALFWorld (ω = 0.1 vs. ω = 0.9)

Coverage Coverage Heterogeneity

Each client receives a different number of task instructions, controlled by dispersion parameter ξ. A smaller ξ leads to higher variance in client dataset sizes, modeling scenarios where some users have extensive task collections while others have very few.

WebShop (ξ = 1 vs. ξ = 256)

ALFWorld (ξ = 1 vs. ξ = 256)

Hardness Hardness Heterogeneity

Each client receives a different mix of easy (previously successful) and hard (previously unsuccessful) tasks, controlled by parameter ξ'. A smaller ξ' leads to higher variance in difficulty levels across clients, modeling scenarios where some users face predominantly easy tasks while others encounter mostly hard tasks.

WebShop (ξ' = 1 vs. ξ' = 256)

ALFWorld (ξ' = 1 vs. ξ' = 256)

Citation

If you find our work useful in your research, please cite:

@article{chen2025fedagent,
  title={Federated Agent Reinforcement Learning},
  author={Canyu Chen and Kangyu Zhu and Zhaorun Chen and Zhanhui Zhou and Shizhe Diao and Yiping Lu and Tian Li and Manling Li and Dawn Song},
  year={2026},
  journal={arXiv preprint},
}