At a high level, bandits are problems which model the explore-exploit tradeoff: Do you exploit the knowledge you have now to gain reward, or explore another option which is potentially worse but potentially better?

As the story goes, the term bandit comes from slot machines. A single-slot machine is a “one-armed bandit” since pulling the arm tends to rob you of your money. The multi-armed bandits arises from a gambler facing a set of slot machines, wondering which arm to pull next.

There is a now a huge number of formulations of the bandit problem, including the multi-armed bandit, the contextual bandit, restless bandits, Kernel bandits, etc. And a tremendous amount of work has gone into each, way more than I dare to summarize here. For intro textbooks see Lattimore or Slivkins.

There are two main tasks one considers in bandit problems:

  • best-arm identification, in which one is asked to identify the arm with the highest reward
  • regret minimization, in which one maximizes cumulative reward and compares the result to some best-in-hindsight policy.

We are typically interested in results in expectation, where the expectation is taken over the pulls of the arms. In Bayesian bandits, however, the expectation is taken over the randomness of the arms and the prior distribution, and one is looking for the optimal policy with respect to this prior.