Mark Crowley | Lab Spotlight : Recent Publications by Majid Ghasemi

Majid Ghasemi only began his doctoral studied this past September 2025, but already he has dove into his research topics on ethical decision making with reinforcement learning with gusto, submitting accepted multiple workshop papers and initial results.

A side project coming out of one of his graduate courses lead to a paper at the recent Canadian AI Conference in Vancouver (Ghasemi & Crowley, 2026). The paper does some analysis and theory providing the first solid stability analysis of a type of scaling limit idea for RL, a work which could a foundation for anyone who is interested in the field.

Ghasemi, M., & Crowley, M. (2026). Scaling Limits of Deep Reinforcement Learning: A Stability Analysis with Maximal Update Parametrization. Canadian Conference on Artificial Intelligence, 6.

Majid also submitted an abstract to a unique biannual interdisciplinary conference on Formal Ethics (Ghasemi & Crowley, 2026). Prof. Crowley will present about in Buffalo in July. The talk will explore new dimensions for how we can see ethics through the lens of reinforcement learning arguing that classic RL and constraint optimization methods have structural limits to the type of ethical reasoning they can perform.

Ghasemi, M., & Crowley, M. (2026, July). Talk: Designing Virtuous Agents via Social Reinforcement Learning. Formal Ethics 2026.

One could argue that even with the use of RLHF to “allign” Large Language Models to shared values, there remain a lot of blindspots in modern LLMs. Some of these failures arise from two types of drift away the best, or at least better, outputs when measured with respect to truth and to ethics. In this workshop paper at ICML (Ghasemi & Crowley, 2026), Majid proposes Social Reinforcement Learning (Social RL) as a way of structurally enforcing feedback integrity. By situating agents in social environments driven by peer critique, reputation, observation, and sanction, Social RL treats alignment as an ongoing negotiation rather than a static specification problem, and offers mechanisms for correcting epistemic errors and stabilizing ethical norms in open-ended environments.

Ghasemi, M., & Crowley, M. (2026). Rethinking AI Alignment: From Static Rewards to Social Reinforcement Learning. Pluralistic Alignment Workshop at ICML 2026. 1. https://openreview.net/forum?id=dm2o7hw14F

In another workshop paper at ICML this year (Ghasemi & Crowley, 2026), Majid explores an old idea that seems to have been forgotten about how to combine rewards, or any scoring metrics, together in a multi-agent system. The main idea being that “consensus” isn’t always agreement amongst all the agents, if there are “tyrant” agents or certain types of peer-influenced collaborations occurring. This has implications for using social interaction to ethical reasoning abilities using RL.

Ghasemi, M., & Crowley, M. (2026). Can Standard MARL Metrics Distinguish Communicative from Strategic Action? ICML 2026 Workshop: Philosophy Meets Machine Learning. 1. https://openreview.net/forum?id=diNaPaat4w

This paper was accepted last term to a workshop at the major international conference AAAI.

Ghasemi, M., & Crowley, M. (2026, January). Toward Virtuous Reinforcement Learning: A Critique and Roadmap. Workshop on Machine Ethics: From Formal Methods To Emergent Machine Ethics at AAAI 2026. 1.

He accomplished a lot in his first two terms, getting two short papers into a national conference and two more into international workshops. All this while showing up an outstanding performance in multiple grad courses, TAing a coures, and now beginning an internship with an industry partner.

And to wrap that up, he even won a Faculty of Engineering Term Award for his accomplisments in Winter 2026.

References:

Talk: Designing Virtuous Agents via Social Reinforcement Learning

Majid Ghasemi, and Mark Crowley

In Formal Ethics 2026. Buffalo, NY, USA. Jul, 2026.

Abs PDF

This extended abstract argues that prevailing deontic and reward-centric approaches to ethical Reinforcement Learning face structural limits. Rule-based methods are brittle under ambiguity, and scalar rewards often com- press multiple values into a single objective that invites proxy gaming. We instead treat ethics as policy-level dispositions—relatively stable habits that hold up when incentives, partners, or contexts change. We propose a Social Reinforcement Learning framework to design virtuous agents that acquire character not through solitary optimization, but through moral mimesis and socially mediated feedback (internalizing the stable norms of a multi-agent population).

Can Standard MARL Metrics Distinguish Communicative from Strategic Action?

Majid Ghasemi, and Mark Crowley

In ICML 2026 Workshop: Philosophy Meets Machine Learning. 2026.

Abs URL

Multi-agent Reinforcement Learning (MARL) systems are routinely evaluated using aggregate utility metrics. A population that converges to high reward is often described as having reached "consensus". Drawing on @habermas1984, we distinguish a justified consensus from strategic action. Standard MARL objectives collapse the distinction: both record as similarly successful. We demonstrate the gap in a minimal foraging environment with one "Tyrant" agent that can unilaterally penalize peers. The system converges to high-reward equilibria nearly indistinguishable from a symmetric baseline by standard metrics. A coercion index, tracking how peers yield to credible threats, exposes what aggregate return hides. Trustworthy MARL evaluation requires diagnostics explicitly sensitive to capability asymmetry.

Rethinking AI Alignment: From Static Rewards to Social Reinforcement Learning

Majid Ghasemi, and Mark Crowley

In Pluralistic Alignment Workshop at ICML 2026. 2026.

Abs PDF URL

Despite the widespread adoption of Reinforcement Learning from Human Feedback, state-ofthe-art AI systems remain prone to two persistent failure modes: hallucination (producing fluent but false content) and moral drift (the convergence towards exploitative or harmful equilibria). We argue that these are not distinct phenomena but plausibly arise from a single underlying cause: feedback collapse. This occurs when complex human values are compressed into fixed scores and frozen offline, decoupling the training signal from the true goals of truth and rightness. We argue that optimizing for these proxies tends to misalign the learning process under distribution shift. To address this, we propose Social Reinforcement Learning (Social RL) as a promising route to structurally enforcing feedback integrity. By situating agents in social environments driven by peer critique, reputation, observation, and sanction, Social RL treats alignment as an ongoing negotiation rather than a static specification problem, and offers mechanisms for correcting epistemic errors and stabilizing ethical norms in open-ended environments.

Scaling Limits of Deep Reinforcement Learning: A Stability Analysis with Maximal Update Parametrization

Majid Ghasemi, and Mark Crowley

In Canadian Conference on Artificial Intelligence. Canadian Artificial Intelligence Association (CAIAC), Vancouver, BC, Canada.. May, 2026.

Abs PDF

While scaling laws have revolutionized supervised learning, their implications for Deep Reinforcement Learning remain under-explored. This paper investigates the theoretical and practical scaling limits of Deep Q-Networks by controlling network parameterization across varying widths. Our empirical results on CartPole-v1 demonstrate that: (1) The standard Feature Learning regime (Mean-Field Theory, α=1) achieves the highest peak performance (Return 79.6) but suffers from catastrophic divergence and rank collapse at large widths; (2) The Lazy Training regime (NTK, α=0) is performant (Return 72.1) but numerically ill-conditioned; and (3) Maximal Update Parametrization (μP, α=0.5) acts as a robust stabilizer, preventing divergence and rank collapse across the entire hyperparameter spectrum, albeit with more conservative learning dynamics (Return 49.7). These findings suggest that while feature learning is necessary for optimal control, naively scaling width without controlling update dynamics leads to optimization instability.