In probability theory and machine learning, the multi-armed bandit problem
(sometimes called the K - or N -armed bandit problem) is named from
imagining a gambler at a row of slot machines (sometimes known as "one-armed
bandits"), who has to decide which machines to play, how many times to play
each machine and in which order to play them, and whether to continue with the
current machine or try a different machine.
More generally, it is a problem in which a decision maker iteratively selects
one of multiple fixed choices (i.e., arms or actions) when the properties of
each choice are only partially known at the time of allocation, and may become
better understood as time passes.