Cooperative AI competitions without spiteful incentives

Let’s say we want to identify effective strategies for multi-agent games like the Iterated Prisoner’s Dilemma or more complex environments (like the kind of environments in Melting Pot). Then tournaments are a natural approach: let people submit strategies, and then play all these strategies against each other in a round-robin tournament.

There are lots of examples of such tournaments: Most famously, Axelrod ran two competitions for the iterated Prisoner’s Dilemma, identifying tit for tat as a good strategy. Less famously, there’s Alex Mennen’s open-source Prisoner’s Dilemma tournament, and my own little “open-prompt” Prisoner’s Dilemma tournament.

But there’s a question of what to do with these results and in particular how to reward participants for their strategies’ success. The most principled approach would be to reward participants in direct proportion to the utility they obtain. If your strategy earns 100 units of utility across interactions with all opponents, you get $100.

Unfortunately, this principled scheme is not very practical.1 Instead, I think tournaments will usually give ranking-based rewards. Alex Mennen and I gave a monetary reward to the top-ranked strategy in our respective tournaments. Axelrod didn’t hand out monetary rewards, but dedicated quite a lot of attention in his book to the top-ranking strategy (or strategies) (crediting Anatol Rapoport for submitting it).

Unfortunately, ranking-based rewards come with a number of problems. One (known) problem is that if you only reward the top (say, top three) programs, high-risk strategies might become more attractive than their expected utility would suggest. (It might be better to get, say, a utility of 100 for 1st place with 10% probability and a utility of 0 for bottom place with 90% probability, than to reliably get a reward of 90 for a ranking between 10 and 20.)2

I here want to focus on a different issue with ranking-based rewards: they create perverse incentives for spiteful or over-competitive behavior. Another’s failure might be your gain, if it makes the other participant fall below you in the rankings. For example, imagine we’re running an Iterated Prisoner’s Dilemma tournament with the following twist. Let’s say that occasionally players get an opportunity to inflict at no cost to themselves a large negative payoff (say, -100) on their opponent in a way that’s undetectable to the opponent (and thus unpunishable). In a normal utility-based framework, there’s no reason to take this action. However, under ranking-based rewards, you would always want to take this spiteful action. Decreasing your opponent’s total utility across the tournament might lower their ranking below yours and thus improve your position in the final standings. Perhaps even more troublingly, for small enough epsilon, you’d be willing to sacrifice epsilon units of your own reward just to inflict this large negative payoff on your opponent.3

There is a relatively simple solution to this spite problem, however: Split the participant pool in two, e.g., at random. Call them Group A and Group B. Instead of having every strategy play against every other strategy (round-robin), have strategies from Group A play only against all the strategies from Group B. Then, rank participants only within Group A. Do the analogous with Group B. So in the end you obtain two rankings: a ranking of Group A in terms of performance (in terms of utility obtained) against Group B; and a ranking of Group B against Group A.

This approach eliminates the spite incentive entirely. Since you’re not ranked against anyone that you play against, you have no intrinsic interest in your opponent’s utilities.

Further considerations: Of course, there’s a cost to this: if we hold the set of submitted strategies fixed (and thereby ignore the problematic incentives of traditional round-robin), then in some sense the two-groups-tournament is less informative than the single-group round-robin tournament. This is because we evaluate each strategy against a smaller sample of opponent strategies. For instance, if we have 40 participants, then normally we’d get an estimate of each participant’s expected utility based on 39 samples of opponent strategies. With the two-group split, we only get an estimate based on 19 samples (from the same distribution) of opponent strategies. So we get worse at deciding whether the submitted strategies are good/bad. (As the number of submission increases, the downside of halving the number of samples decreases, but the motivating risk of round-robin tournaments is also more pronounced if the number of submissions is small.)

A different possible concern is that outside observers will not abide by the “ranking only within group” rule. If all of the submitted strategies are published, then an observer could simply run a round-robin tournament on the submitted strategies and publish an alternate ranking. This alternate ranking might then confer prestige, and to optimize their position in the alternate ranking, participants are once again incentivized to submit spiteful strategies. To prevent this, one would have to make it difficult enough to reproduce the tournament. (In many cases, one might justifiably hope that nobody will bother publicizing an alternate ranking, though.)

Even if just the overall utility obtained is published, then a third-party observer might simply take the highest-scoring strategy across the two groups to be the “real winner”. For small numbers of participants, the two winners are incomparable, because they faced different sets of opponents. But the distribution over opponents is the same. So as the participant pool grows, the comparison becomes valid. If outside parties naively aggregate across groups and the resulting comparisons confer prestige, participants are once more incentivized to submit spiteful programs to fare better in the aggregated rankings.

To combat this, the tournament organizers could try to publish more limited information about the different strategies’ scores.4

Related ideas: The spite incentive problem and the population split idea follow a common pattern in mechanism design: In many cases, your first idea for a mechanism will set some bad incentives on participants, because of some specific downstream effect of the participants’ decisions. To fix the mechanism, you can remove that downstream effect, usually at some cost.

For instance, this is how you might get from a first- to a second-price auction:

  • The most natural idea for an auction is to just ask participants (bidders / people interested in the item you’re looking to sell) what they’d be willing to pay for the item and sell to the highest bidder. How much do you then charge the highest bidder? Well, however much they’re willing to pay, of course! Or perhaps, however much they’re willing to pay minus $1 (since otherwise they wouldn’t gain anything from participating).
  • But now you realize that participants aren’t necessarily incentivized to honestly report how much they’re willing to pay: by reporting a lower number they might get away with paying less for your item.
  • Let’s say that you want bidders to report their willingness to pay honestly (e.g., because you want to make sure to sell to whoever wants it most). Then to fix the incentive issue, you need to determine the price paid by the winner in a way that doesn’t depend on the winner’s bid, while still making sure that the price is at most equal to the winner’s bid. One way of doing this is to charge them the second-highest bid. (It could also be the third-highest bid or the average of all the other bids, but, of course, these numbers are all lower than the second highest bid and so result in less profit. If in addition to the participants’ bids, you get a signal from a third party, telling you minimum distance between bids is $20, then you could also charge the second bid plus $20.) Of course, the cost is that the price we charge is not informed by the winning participant’s willingness to pay.

The population splitting idea can be discovered by an analogous line of thinking: to fix the spite incentive, you need to remove the influence from your opponent’s utility on your ranking. The simplest way of doing this is to only play against opponents against whom you’re not subsequently ranked.

Of course, the details of the auction case are quite different from the present setting.

The most closely related idea that I’m aware of is about eliminating similar bad incentives in peer review. In peer review, the same people often serve as reviewers and as authors. Conferences often select papers in part based on rankings (rather than just absolute scores) of the papers. In particular, they might choose based on rankings within specific areas. Therefore, if you serve as a reviewer on a paper in your area, you might be incentivized to give the paper a lower score in order to improve the ranking of your paper. The obvious idea to avoid this problem is to make sure that a reviewer isn’t assigned to a paper that they’re competing against. For a brief introduction to this problem and solution, along with some references, see Shah (2022, Section “Dishonest Behavior”, Subsection “Lone Wolf”). I think the fundamental idea here is the same as the idea behind the above partition-based approach to cooperative AI competitions. Some of the details importantly differ, of course. For instance, each reviewer can only review a small number of papers anyway (whereas in the cooperative AI competition one can evaluate each program against all other programs).

Footnotes:

  1. In many cases, the rewards aren’t primarily monetary. Rather, a high ranking is rewarded with attention and prestige, and associated prestige and attention are hard to control at all. But even distributing monetary rewards in proportion to utility is impractical. For example, you now effectively reward people merely for entering the competition at all. Requiring a submission fee to counter this has its own issues (such as presumably becoming subject to gambling regulations). ↩︎
  2. There is some literature on this phenomenon in the context of forecasting competitions. That is, in a forecasting competitions with ranking-based reward, forecasters typically aren’t incentivized to submit predictions honestly, because they want to increase (and sometimes decrease!) the variance of their score. See, e.g., Lichtendahl and Winkler (2007) and Monroe et al. (2025). ↩︎
  3. How big is the spite issue? Well, for one this depends, of course, on the game that is being played. My guess is that in the simple Prisoner’s Dilemma games that most of the existing tournaments have focused on, these spiteful incentives are relatively insignificant. There just aren’t any opportunities to hurt opponents cheaply. But I’d imagine that in more complex games, all kinds of opportunities might come up, and so in particular there might be opportunities for cheap spite.

    Second, the importance of spite depends on the number of participants in the tournament. In the extreme case, if there are only two participants, then the interaction between the two is fully zero-sum. As the number of participants grows, the benefits of harming a randomly selected opponent decrease, as a randomly selected opponent is unlikely to be a relevant rival.

    The exact dynamics are complicated, however. For instance, if there are only two participants with a realistic chance at first place (perhaps due to technical implementation challenges), then encounters between these top contenders become effectively zero-sum from each player’s perspective – if they can “recognize” each other. ↩︎
  4. A natural first idea is to only publish relative scores for each group. For instance, in each group separately, subtract from each strategy’s utility the utility of the top strategy. This way the published top score in each group will be 0. This will work against a naive third party. But a more sophisticated third party might be able to infer the original scores if there are pairs of very similar strategies between the groups. For instance, in the Iterated Prisoner’s Dilemma, we might imagine that both groups will contain a DefectBot, i.e., a strategy of defecting all the time, regardless of the opponent’s actions. So in some sense this approach obscures too little information. For better protection, one might have to obscure information even more. The question of how to do this is similar to the questions studied in the Differential Privacy framework.

    One radical approach would be to only publish the ranking and to never publish numeric scores. Presumably this makes it impossible to compare across groups. This has other downsides, though. If, say, our top-three strategies achieve similar scores, they deserve similar amounts of attention – they are similarly promising approaches to cooperation. If we publish scores, outside observers will be able to tell that this is the case. If we publish just the ranking, observers won’t know whether, say, the second-best strategy is competitive with the best or not. So we’re making it more difficult for outside observers to draw conclusions about the relative promise of different research directions. We might also worry that publishing only the ranking will worsen the incentives toward high-variance strategies, because the reward for coming in at a close second place is lower. ↩︎

5 thoughts on “Cooperative AI competitions without spiteful incentives

  1. Lukas Finnveden's avatar Lukas Finnveden

    Nice post.

    Here’s an idea for crowning a single winner (that maybe protects you from having to keep all that information secret):

    • Randomly sort into group A and group B and see which 2 strategies come out on top of each ranking.
    • Re-randomize into group A and group B and repeat N times. (Recording statistics of all the rankings.)
    • Crown as winner whatever strategy had the highest probability of coming out on top (within their own group) across all N rounds.

    (You could also crown the winner based on any other statistic computed just from the rankings within each group. E.g.: Highest average ranking across all N rounds.)

    This would defeat uniform spitefulness. However, it could still incentivize strategic targeted spitefulness/helpfulness — where you spite strategies that seem more likely to threaten your victory, and help strategies that seem unthreatening.

    ——————

    Here’s an idea that seems more principled: You can do pair-wise comparisons of two strategies Si and Sj by playing them both off against each other strategy in the pool (but not against each other). If Si gets more utility than Sj, than we say that Si beats Sj. If you have a large pool of entrants, I think you’re pretty likely to get a condorcet winner, because all match-ups are very similar (you’re just removing and adding one strategy from the mix).

    I’m not sure what you do if you get cycles though, and if that could introduce incentives for spitefulness. But people have devised lots of voting rules for this sort of thing. Maybe one of them could fully defeat spitefulness? (You can maybe also bring in the Si vs. Sj fight as a tie-breaker.)

    ——————

    (Note that these sorts of tournaments aren’t significantly more computationally expensive than a normal round-robin tournament, because you can run each 1v1 fight just once, cache the result, and then reuse it for each simulated re-match.)

    Liked by 1 person

    1. Lukas Finnveden's avatar Lukas Finnveden

      Further thoughts:

      • Maybe “champion” is a better term than “condorcet winner” here.
        • Claude says “champion” is an established term from “tournament theory”. We probably want to use ideas from tournament theory rather than voting theory since this is more like a tournament.
      • I’m pretty sure that your impact on opponents’ utilities can’t change whether you become a champion (/condorcet winner) or not, in the above scheme. So if the pay-off structure was just “you win a prize if you’re the champion, otherwise no one wins anything”, I think that wouldn’t incentivize spite.
        • (Though if you’re not a champion yourself, you might be able to affect whether anyone else becomes a champion. So e.g. “you win a prize if you’re the champion, otherwise it gets split equally among everyone in the top cycle” probably incentivizes strategic spite/helpfulness sometimes.)
      • This paranthetical was a terrible idea: “You can maybe also bring in the Si vs. Sj fight as a tie-breaker”. This directly introduces the incentive for spite in a maximal, zero-sum fashion.
      • I feel somewhat optimistic that it’s possible to find a principled algorithm that picks a single winner (or tie when that’s appropriate) without introducing incentives for spite, because it kinda feels like you should be able to keep repeating the trick “specifically exclude games between Si and Sk when you’re trying to determine if Si or Sj should be the winner”. But it seems like this could quickly get pretty complex, and my intuition could totally be wrong.

      Like

      1. Lukas Finnveden's avatar Lukas Finnveden

        Oops.

        because it kinda feels like you should be able to keep repeating the trick “specifically exclude games between Si and Sk when you’re trying to determine if Si or Sj should be the winner”.

        This should say “specifically exclude games between Si and Sj when you’re trying to determine if Si or Sj should be the winner”.

        Like

      2. (I started writing this in response to your initial comment, then only saw your newer comments a bit later. So this is a bit less coherent than it might otherwise be.)

        These are both very nice ideas!

        I agree with your analysis of the first proposal. It’s interesting that the problematic dynamics here are quite different, especially the targeted helpfulness point.

        One could debate how much this buys us relative to the naive single-ranking approach. In principle, if “recognition” and “helping” are both easy/cheap while spite is difficult/expensive, then the naive ranking might even be better, right? So probably it depends a bit on the environment and so on. Intuitively, it does seem better, though. Recognizing specific opponent strategies seems quite difficult.

        I also like the second method. If we could be sure that there’ll be a Condorcet winner, this would obviously be much better than my proposed method (single winner; throwing away much less/no data).

        My guess is that it’s difficult to fill in the “no Condorcet winner” gaps. I don’t have a rigorous argument for why this is impossible, but it seems difficult to fill the gap without giving you some sort of preference over the relative rankings of your opponents. E.g., if you know that you’ll beat Alice and you know that Bob will beat you, then it seems that these Condorcet-winner-type approaches will always make you want to help Alice and hurt Bob, in order to ensure that Bob is not a Condorcet winner, and to trigger a different kind of evaluation, which you then have some chance of winning. (E.g., presumably the following doesn’t work for this reason: If there is a Condorcet winner, the Condorcet winner is the champion. Otherwise, take the “winning Condorcet cycle” and thank its members against the rest of the population.) But I’m unsure.

        I think it’s probably true that as the number of participants grows, the probability that there is a Condorcet winner approaches 1, at least under some reasonable assumptions. (Maybe if many people submit extremely similar strategies, then it remains likely that there’s no Condorcet winner?) But in general it seems that things get easier if the number of participants grows.

        I also like the “just don’t fill in the gap” version (i.e., if there is a Condorcet winner, have it be the winner (champion); otherwise, just don’t have a winner). I agree that it probably doesn’t have the spite issue. Perhaps sometimes not crowning a winner is similarly good as always having two winners. (It depends on how likely the “no winner” outcome is, of course.)

        >This paranthetical was a terrible idea: “You can maybe also bring in the Si vs. Sj fight as a tie-breaker”. This directly introduces the incentive for spite in a maximal, zero-sum fashion.

        I agree. 🙂

        One other argument against both of these methods is that the “output” of the method is not (based on) an estimator of the expected utility of the submitted strategies against the distribution of submitted opponent strategies. It’s a little bit unclear how much this matters, given that one of the premises of the setting is that participants might not care about their utilities, and that risk attitudes are messed up by ranking considerations. But one thing that’s nice about the round-robin-type stuff (including the split version) is that the utilities are still part of the final output. So you’re at least leaving space for people to care about these numbers. For instance, the best team (the team that expects to win the competition) might care about by how much they win, in part because third parties looking at the results will see these numbers. If the primary output of your method is based on lots of pairwise comparisons, you’re pushing people toward caring about pretty different stuff. For instance, the best team might now only care about making sure that it’s better by some safe margin than every strategy that they think competitors might submit. Not clear whether this matters much. E.g., in principle you could publish both a ranking with utilities obtained and determine the winner of a cash reward based on pairwise comparisons.

        Like

        1. Lukas Finnveden's avatar Lukas Finnveden

          One could debate how much this buys us relative to the naive single-ranking approach. In principle, if “recognition” and “helping” are both easy/cheap while spite is difficult/expensive, then the naive ranking might even be better, right? Intuitively, it does seem better, though. Recognizing specific opponent strategies seems quite difficult.

          Yep, agree it’s unclear.

          if you know that you’ll beat Alice and you know that Bob will beat you, then it seems that these Condorcet-winner-type approaches will always make you want to help Alice and hurt Bob, in order to ensure that Bob is not a Condorcet winner, and to trigger a different kind of evaluation, which you then have some chance of winning

          Yeah it does seem tricky. Agree that the example strategy probably doesn’t work for the stated reason.

          One other argument against both of these methods is that the “output” of the method is not (based on) an estimator of the expected utility of the submitted strategies against the distribution of submitted opponent strategies.

          Yeah that makes sense as a desiderata. It would at the very least be nice to get a complete ranking that one can publish, rather than just a single winner.

          If it is somehow possible to crown a single winner without weird incentives (using some weird solution in the cases without condorcet winner), then I guess the next question would be whether there’s a way to create a whole ranking without weird incentives. And if that is also somehow possible, then the next question after that would be if there’s some cardinal statistic capturing roughly what utility each strategy get, while also corresponding to a ranking without weird incentives.

          Not questions for today 🙂

          Like

Leave a reply to Lukas Finnveden Cancel reply