In this research we study the interactions of multiple self-interested agents when the agents can learn about the environment and each others strategies. We have analyzed how the Q-learning algorithm from reinforcement learning can be used in such a setting. We have also discovered several problems - discussed in the papers - in applying reinforcement learning to multiagent settings.
Reinforcement learning (RL) is based on the idea that the tendency to produce an action should be strengthened (reinforced) if it produces favorable results, and weakened if it produces unfavorable results. Q-learning is a recent RL algorithm that does not need a model of its environment and can be used on-line. Therefore it is well-suited for use in repeated games against an unknown opponent. Most RL research has been confined to single agent settings or to multiagent settings where the agents have totally positively correlated payoffs (team problems) or totally negatively correlated payoffs (zero-sum games). We have empirically studied reinforcement learning in the iterated prisoner's dilemma (IPD), where the agents' payoffs are neither totally positively nor totally negatively correlated. RL is considerably more difficult in such a domain. We have investigated the ability of a variety of Q-learning agents to play the IPD game against an unknown opponent. In some experiments, the opponent is the fixed strategy Tit-for-Tat, while in others it is another Q-learner. All the Q-learners learned to play optimally against Tit-for-Tat. Playing against another learner was more difficult because the adaptation of the other learner created a nonstationary environment, and because the other learner was not endowed with any a priori knowledge about the IPD game such as a policy designed to encourage cooperation. The learners that were studied varied along three dimensions: the length of history they received as context, the type of memory they employed (lookup tables based on restricted history windows or recurrent neural networks that can theoretically store features from arbitrarily deep in the past), and the exploration schedule they followed. Although all the learners faced difficulties when playing against other learners, agents with longer history windows, lookup table memories, and longer exploration schedules fared best in the IPD games.
Questions about this research may be directed to sandholm@cs.wustl.edu.Sandholm, T. and Crites, R. 1996. On Multiagent Q-Learning in a Semi-competitive Domain. In Adaptation and Learning in Multi-Agent Systems, Weiß, G. and Sen, S., eds. Lecture Notes in Artificial Intelligence 1042 of Lecture Notes in Computer Science, Springer-Verlag, pp. 191-205.
Sandholm. T., Brodley, C., Vidovic, A. and Sandholm, M. 1996. Comparison of Regression Methods, Symbolic Induction Methods and Neural Networks in Morbidity Diagnosis and Mortality Prediction in Equine Gastrointestinal Colic. Working Notes of the AAAI Spring Symposium Series, Artificial Intelligence in Medicine: Applications of Current Technologies, pp. 154-159, Stanford University, CA.
Sandholm, M., Sandholm. T., Brodley, C., and Vidovic, A. 1996. Linear and logistic regression, symbolic induction methods, and neural networks in morbidity diagnosis and mortality prediction in equine gastrointestinal colic: An extended abstract. The Second Annual SEPSIS / SIRS Conference: Reducing Mortality to Patients & Suppliers, Washington, D.C. Poster presentation.
Sandholm, T. and Crites, R. 1995. Multiagent Reinforcement Learning in the Iterated Prisoner's Dilemma. Biosystems 37: 147-166, Special Issue on the Prisoner's Dilemma. To appear.
Sandholm, T. and Crites, R. 1995. On Multiagent Q-Learning in a Semi-competitive Domain. 14th International Joint Conference on Artificial Intelligence (IJCAI-95), Workshop on Adaptation and Learning in Multiagent Systems, Montreal, Canada, pp. 71-77.
Berkman, N. and Sandholm, T. 1995. What should be minimized in a decision tree: A re-examination. University of Massachusetts at Amherst, Computer Science Technical Report TR 95-20.