History dependence is not the same as history utility. All policies exhibit detectable history dependence, yet only 43.2% show a significant performance gain from memory.
Probing Dec-POMDP Reasoning in Cooperative MARL
Motivation
As more multi-agent environments are developed, it is important to know whether they test the intended Dec-POMDP properties. High returns can mask a failure to learn the underlying coordination challenge.
Research Question
Do modern cooperative MARL environments truly test the Dec-POMDP properties that make these problems hard, or do they permit success via strategies that bypass them?
TL;DR
In many modern environments, reactive policies (policies with no memory) match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence.
Contributions (Click me)
- Diagnostic framework. We introduce information-theoretic probes – measuring history dependence, private information flow, synchronous action coupling, and directed temporal influence – that audit whether learned policies actually exhibit Dec-POMDP reasoning, beyond what raw returns reveal.
- Systematic benchmark audit. We evaluate 37 scenarios across seven benchmark suites, revealing that history dependence is ubiquitous but rarely performance-critical, coordination structures vary qualitatively across domains, and few environments jointly test both partial observability and coordination.
- Open-source tooling and implications. We release diagnostic tools for researchers to audit their own environments, and discuss implications for designing tasks where partial observability and coordination are non-optional.
Abstract
Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. The Dec-POMDP framing motivates agents that use history to update hidden states and coordinate based on local information. Yet it remains unclear whether commonly trained policies on popular benchmark suites exhibit this reasoning or succeed via simpler strategies.
We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that, under these baseline policy distributions, these behaviours often do not show evidence that success depends on genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence.
These findings suggest that, under current training paradigms and baseline policies, some benchmark evaluations may not fully exercise core Dec-POMDP assumptions, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous policy-behaviour analysis and evaluation in cooperative MARL.
Methodology
Results
% of scenarios whose tested IPPO/MAPPO policy behaviours are flagged by each diagnostic decision rule. Colour intensity ∝ fraction. Click any block to unpack the result.
Hidden state and private information are separable. PIF flags 70.3% of scenarios, often different ones from HAR.
Coordination is structurally diverse. AA and DAI dissociate across benchmarks, separating synchronous action coupling from temporal influence.
Few benchmarks jointly test partial observability and coordination. MPE is the only suite in which every scenario satisfies all diagnostic criteria.
Together, these results motivate benchmarks where partial observability and decentralised coordination are non-optional for success, so evaluation reflects how policies solve tasks, not just how high their returns are. See the paper appendix for detailed diagnostic values, thresholds, null baselines, and confidence intervals.
| MPE | SMAX V1 | SMAX V2 | MaBrax | Hanabi | OC V1 | OC V2 | |
|---|---|---|---|---|---|---|---|
| Is partial observability relevant?ΔMem + HAR | 100%(3/3) | 100%(9/9) | 100%(3/3) | 20%(1/5) | 0%(0/1) | 0%(0/5) | 0%(0/11) |
| Do agents use hidden teammate information?PIF | 100%(3/3) | 67%(6/9) | 67%(2/3) | 40%(2/5) | 0%(0/1) | 20%(1/5) | 82%(9/11) |
| Is synchronous coordination detected?AA | 100%(3/3) | 44%(4/9) | 0%(0/3) | 80%(4/5) | 0%(0/1) | 100%(5/5) | 82%(9/11) |
| Is temporal coordination detected?DAI | 100%(3/3) | 67%(6/9) | 67%(2/3) | 80%(4/5) | 100%(1/1) | 40%(2/5) | 100%(11/11) |
OC = Overcooked. Under the tested policy behaviours, MPE is the only suite where every scenario is flagged by all four criteria. Scroll right on small screens. Full diagnostic tables are in the paper appendix.
Click a result block
Choose a benchmark and diagnostic to see what that cell says about partial observability, hidden teammate information, or coordination structure.
Quickstart
Install and run all five diagnostics on your own trajectories:
pip install -e .
import dec_pomdp_diagnostics as dpd
data = dpd.UserData(
observations = {"agent_0": obs0, "agent_1": obs1}, # (N, obs_dim)
actions = {"agent_0": act0, "agent_1": act1}, # (N,) int
timesteps = {"agent_0": ts0, "agent_1": ts1}, # (N,) int
episode_ids = {"agent_0": eps0, "agent_1": eps1}, # (N,) int
hidden_states = {"agent_0": h0, "agent_1": h1}, # optional, RNN only
env_name="my_env", alg_name="IPPO_RNN", seed=0,
)
result = dpd.compute_diagnostics(data, history_k=3, null_reps=5)
print(result.describe())
# ✓ History dependence > null
# ✓ Do agents use hidden teammate info?
# ✗ Does synchronous coordination emerge?
# ✓ Does temporal coordination emerge?
Challenges and Limitations
Policy-dependent probes
All diagnostics are expectations under the converged joint policy pπ and therefore characterise learned behaviour under IPPO/MAPPO with FF/RNN architectures, not worst-case or best-case properties of the environment. This is deliberate, as we probe behaviours induced by widely used algorithms; however, stronger or weaker algorithms may yield different diagnostic profiles for the same scenario.
Estimation noise
Our MI/CMI/DI estimators (kNN and KSG) are biased in finite samples, especially with long histories or large action spaces. We mitigate this via permutation null baselines that account for estimator-specific bias, and report bootstrap confidence intervals throughout. Nonetheless, these probes are diagnostic tools, not hard pass/fail filters, and borderline cases should be interpreted with caution.
BibTeX
@inproceedings{tessera2026probing,
title={Probing Dec-{POMDP} Reasoning in Cooperative {MARL}},
author={{Kale-ab} Abebe Tessera and Leonard Hinckeldey and Riccardo Zamboni and David Abel and Amos Storkey},
booktitle={The 25th International Conference on Autonomous Agents and Multi-Agent Systems},
year={2026},
url={https://openreview.net/forum?id=gSK8tR7du3}
}