## The Explore-Exploit Dilemma

There are two completely different goals you need to address when you try to optimize a website: you need to (A) learn about new ideas (which we’ll always call **exploring** from now on), while you also need to (B) take advantage of the best of your old ideas (which we’ll always call **exploiting** from now on).

## What does “optimizing a website mean”?

“Optimizing a website” means a step-by-step process in which a web developer makes a series of changes to a website, each of which is meant to increase the **success** of that site.The word “success” will be used to describe measurable achievements like Traffic - Did a change increase the amount of traffic to a site’s landing page?

## A/B testing and its problems

To scientifically identify the changes that might lead to success, **controlled experimentation** is required. Websites resort to **A/B testing** which typically involves randomly assigning an incoming web user to one of two groups: Group A or Group B. This random assignment of users to groups continues on for a while until the web developer becomes convinced that either Option A is more successful than Option B or, vice versa. After that, the web developer assigns all future users to the more successful version of the website and closes out the inferior version of the website. Though the experimental approach to trying out new ideas has been extremely successful in the past and will continue to be successful in many contexts, it has a few problems.

Standard A/B testing consists of:

A short period of pure exploration, in which you assign equal numbers of users to Groups A and B.

A long period of pure exploitation, in which you send all of your users to the more successful version of your site and never come back to the option that seemed to be inferior.

Why might this be a bad strategy?

It jumps discretely from exploration into exploitation, when you might be able to smoothly transition between the two.

During the purely exploratory phase, it wastes resources exploring inferior options in order to gather as much data as possible. But you shouldn’t want to gather data about strikingly inferior options.

## Bandit Algorithms vs A/B Testing

Bandit Algorithms smoothly decrease the amount of exploring they do over time instead of requiring you to make a sudden jump.

They focus your resources during exploration on the better options instead of wasting time on the inferior options that are over-explored during typical A/B testing.

In fact, bandit algorithms address both of those concerns in the same way because they gradually fixate on the best available options over time - the technical term for this is **convergence**.

Bandit algorithms subsume A/B testing as a special case. Standard A/B testing describes one extreme case in which you jump from pure exploration to pure exploitation. Bandit algorithms let you operate in the much larger and more interesting space between those two extreme states.

## Bandit Problem Abstraction

**Arm** - In general, we’re going to assume that we have a fixed set of \(N\) different options and that we can enumerate them.

**Reward** - A reward is simply a measure of success: it might tell us whether a customer clicked on an ad or signed up as a user. What matters is simply that (A) a reward is something quantitative that we can keep of track of mathematically and that (B) larger amounts of reward are better than smaller amounts.

We’re facing a complicated slot machine, called a bandit, that has a set of \(N\) arms that we can pull on.

When pulled, any given arm will output a reward. But these rewards aren’t reliable, which is why we’re gambling. Any specific pull of any specific arm is risky.

Not only is each pull of an arm risky, we also don’t start off knowing what the reward rates are for any of the arms. We have to figure this out experimentally by actually pulling on the unknown arms.

We only find out about the reward that was given out by the arm we actually pulled. Whichever arm we pull, we miss out on information about the other arms that we didn’t pull.

Every time we experiment with an arm that isn’t the best arm, we lose reward because we could, at

**least in principle**, have pulled on a better arm.

The full Multiarmed Bandit Problem is defined by the five features above. Any algorithm that offers you a proposed solution to the Multiarmed Bandit Problem must give you a rule for selecting arms in some sequence. And this rule has to balance out your competing desires to (A) learn about new arms and (B) earn as much reward as possible by pulling on arms you already know are good choices.

## Bandit Algorithm Abstraction

**select_arm** - Every time we have to make a choice about which arm to pull, we want to be able to simply make a call to our favourite algorithm and have it tell us the numeric name of the arm we should pull. This method doesn’t take any arguments and returns the index of the next arm to pull.

**update** - After we pull an arm, we get a reward signal back from our system. We want to update our algorithm’s beliefs about the quality of the arm we just chose by providing this reward information. The update function takes as arguments (1) an algorithm object, (2) the numeric index of the most recently chosen arm and (3) the reward received from choosing that arm. The update method will take this information and make the relevant changes to the algorithm’s evaluation of all of the arms.

**counts** - A vector of integers of length \(N\) that tells us how many times we’ve played each of the \(N\) arms available to us in the current bandit problem.

**values** - A vector of floating point numbers that defines the average amount of reward we’ve gotten when playing each of the \(N\) arms available to us.

### Python code

```
from abc import ABC, abstractmethod
class BanditAlgorithm(ABC):
def initialize(self, n_arms):
self.counts = [0 for col in range(n_arms)]
self.values = [0.0 for col in range(n_arms)]
return
@abstractmethod
def select_arm(self):
pass
@abstractmethod
def update(self, chosen_arm, reward):
pass
```

## Arm abstraction

### Python code

```
from abc import ABC, abstractmethod
class Arm(ABC):
@abstractmethod
def draw(self):
pass
class BernoulliArm(Arm):
def __init__(self, p):
self.p = p
def draw(self):
if random.random() > self.p:
return 0.0
else:
return 1.0
```

## Epsilon-Greedy Algorithm

### Algorithm

With probability 1 – \(\epsilon\), the epsilon-Greedy algorithm exploits the best known option.

With probability \(\epsilon / 2\), the epsilon-Greedy algorithm explores the best known option.

With probability \(\epsilon / 2\), the epsilon-Greedy algorithm explores the worst known option.

**epsilon(\(\epsilon\))** - this is a floating point number that tells us the frequency with which we should explore one of the available arms.

### Python code

```
import random
def ind_max(x):
= max(x)
m return x.index(m)
class EpsilonGreedy(BanditAlgorithm):
def __init__(self, epsilon, counts, values):
self.epsilon = epsilon
return
def select_arm(self):
if random.random() > self.epsilon:
return ind_max(self.values)
else:
return random.randrange(len(self.values))
def update(self, chosen_arm, reward):
self.counts[chosen_arm] = self.counts[chosen_arm] + 1
= self.counts[chosen_arm]
n = self.values[chosen_arm]
value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward
new_value self.values[chosen_arm] = new_value
return
```

### Analysis

If we set \(\epsilon = 1.0\), the algorithm will always choose among the different arms completely at random. This will provide you with a lot of data and will produce very clean data, because all of the arms will have equal amounts of data and there won’t be any lurking confounds that make it hard to understand why you got the results you got.

If you eventually set \(\epsilon = 0.0\), it’s true that you’ll stop wasting time on bad options. But you’ll never be able to learn about new options ever again.

There’s no reason you need to operate at either of these two extremes. Instead of moving from one period of completely random experimentation to another period of absolutely greedy selection of the so-called best option, the epsilon-Greedy algorithm lets you be more gradual.

### Weaknesses

The first weakness is that, as you get more certain which of the options is best, this tendency to explore the worse design a full \(100ϵ/2\)% of the time will become more wasteful. In the jargon of bandit algorithms, you’ll be over-exploring.

And there’s another problem with a fixed \(100ϵ\)% exploration rule: at the start of your experimentation, you’ll choose options that you don’t know much about far more rarely than you’d like to because you only try new options \(100\epsilon\)% of the time.

## The Softmax Algorithm

### Algorithm

We need to make our bandit algorithm care about the known differences between the estimated values of the arms when our algorithm decides which arm to explore. The Softmax algorithm tries to cope with arms differing in estimated value by explicitly incorporating information about the reward rates of the available arms into its method for choosing which arm to select when it explores.

Let \(\tau\) be the temperature parameter. The Softmax algorithm is as follows:

At time \(T\), select one of the arms with probabilities computed as follows:

- \(\mathbb{P}[A_i] = (e^{rA_i} / \tau) / \sum_{i=1}^n (e^{rA_i}/\tau)\) where \(rA_i\) is the mean reward for arm \(A_i\) till the time \(T-1\)

For whichever arm you picked, update your estimate of the mean using the same update rule we used for the epsilon-Greedy algorithm.

### Python code

```
import random
def categorical_draw(probs):
= random.random()
z = 0.0
cum_prob for i in range(len(probs)):
= probs[i]
prob += prob
cum_prob if cum_prob > z:
return i
return len(probs) - 1
class Softmax(BanditAlgorithm):
def __init__(self, temperature):
self.temperature = temperature
return
def select_arm(self):
= sum([math.exp(v / self.temperature) for v in self.values])
z = [math.exp(v / self.temperature) / z for v in self.values]
probs return categorical_draw(probs)
def update(self, chosen_arm, reward):
self.counts[chosen_arm] = self.counts[chosen_arm] + 1
= self.counts[chosen_arm]
n = self.values[chosen_arm]
value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward
new_value self.values[chosen_arm] = new_value
return
```

### Analysis

It’s easiest to think of \(\tau\) as letting us shift the behaviour of the Softmax algorithm along a continuum defined by two extreme ways to select arms. At one extreme, we set \(\tau = 0.0\). This will give us a fully deterministic choice of the arm that has the highest estimated value.

At the other extreme, we set \(\tau = \infty\), which gives us purely random exploration like we got out of the epsilon-Greedy algorithm. The reason this \(\tau\) parameter is called a temperature parameter is that its effect on the selection of arms is like the effect of temperatures on atoms in traditional physics: at low temperatures, atoms will behave orderly and produce solids, but at high temperatures, they behaviour randomly and will produce gases.

## The Annealing Algorithm

The algorithms so far have one systematic weakness: they don’t keep track of how much they know about any of the arms available to them. They pay attention only to how much reward they’ve gotten from the arms. This means that they’ll under explore options whose initial experiences were not rewarding, even though they don’t have enough data to be confident about those arms.

Both the epsilon-Greedy algorithm and the Softmax algorithm share the following broad properties:

The algorithm’s default choice is to select the arm that currently has the highest estimated value

The algorithm sometimes decides to explore and chooses an option that isn’t the one that currently seems best:

The epsilon-Greedy algorithm explores by selecting from all of the arms completely at random. It makes one of these random exploratory decisions with probability epsilon.

The Softmax algorithm explores by randomly selecting from all of the available arms with probabilities that are more-or-less proportional to the estimated value of each of the arms. If the other arms are noticeably worse than the best arm, they’re chosen with very low probability. If the arms all have similar values, they’re each chosen nearly equally often.

In order to achieve better performance by making an effort to have these two algorithms explore less over time, both algorithms can be set up to modify their basic parameters dynamically over time. We call this modification

**annealing**.

For bandit algorithms, we can metaphorically decrease the temperature by changing the temperature parameter. This will make our algorithm exploit the best arm more often and settle into its final deterministic strategy for choosing an arm.

### Python code for Softmax Annealing

```
import math
import random
class AnnealingSoftmax(Softmax):
def select_arm(self):
''' Implementation of annealing'''
= sum(self.counts) + 1
t = 1 / math.log(t + 0.0000001)
temperature
super().select_arm()
```

### Python code for EpsilonGreedy Annealing

```
import math
import random
class AnnealingEpsilonGreedy(EpsilonGreedy):
def select_arm(self):
''' Implementation of annealing'''
= sum(self.counts) + 1
t = 1 / math.log(t + 0.0000001)
temperature
super().select_arm()
```

## UCB1 Algorithm

UCB can improve upon the epsilon-Greedy and Softmax algorithms: it can make decisions to explore that are driven by our confidence in the estimated value of the arms we’ve selected.

Why is it important to keep track of our confidence in the values of the arms?

The reason has to do with the nature of the rewards we receive from the arms: they’re noisy. If we use our past experiences with an arm, then the estimated value of any arm is always a noisy estimate of the true return on investment we can expect from it. The epsilon-Greedy and Softmax algorithms aren’t robust to this noise during their first experiences with things.The epsilon-Greedy and Softmax algorithms are ‘gullible’. They are easily misled by a few negative experiences. Because of their use of randomness, they can make up for this later.

UCB takes a very different approach. Instead, UCB avoids being gullible by requiring us to keep track of our confidence in our assessments of the estimated values of all of the arms. To do that, we need to have some metric of how much we know about each arm. We’ve been explicitly keeping track of the number of times we’ve pulled each arm for both of the algorithms we’ve used so far. We just need to find a way to take advantage of that information. The UCB family of algorithms does just that. In fact, their focus on confidence is the source of the name UCB, which is an acronym for Upper Confidence Bounds.

UCB1 algorithm, unlike its siblings, makes a couple of assumptions that you may need to be cautious about. Foremost of these is the assumption that the maximum possible reward has value \(1\). If that’s not true in your setting, you need to rescale all of your rewards to lie between \(0\) and \(1\) before using the UCB1 algorithm.

In addition to explicitly keeping track of our confidence in the estimated values of each arm, the UCB algorithm is special for two other reasons:

- UCB doesn’t use randomness at all. Unlike epsilon-Greedy or Softmax, it’s possible to know exactly how UCB will behave in any given situation. This can make it easier to reason about at times.
- UCB doesn’t have any free parameters that you need to configure before you can deploy it. This is a major improvement if you’re interested in running it in the wild, because it means that you can start to use UCB without having a clear sense of what you expect the world to behave like.

### Python Code

```
class UCB1(BanditAlgorithm):
def select_arm(self):
= len(self.counts)
n_arms for arm in range(n_arms):
if self.counts[arm] == 0:
return arm
= [0.0 for arm in range(n_arms)]
ucb_values = sum(self.counts)
total_counts for arm in range(n_arms):
= math.sqrt((2 * math.log(total_counts)) / float(self.counts[arm]))
bonus = self.values[arm] + bonus
ucb_values[arm] return ind_max(ucb_values)
def update(self, chosen_arm, reward):
self.counts[chosen_arm] = self.counts[chosen_arm] + 1
= self.counts[chosen_arm]
n = self.values[chosen_arm]
value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward
new_value self.values[chosen_arm] = new_value
return
```

### Analysis

The UCB algorithm ensures that it has played every single arm available to it at least once. This is UCB’s clever trick for insuring that it doesn’t have a total cold start before it starts to apply its confidence-based decision rule.

It’s important to keep this initialization step in mind when you consider deploying UCB1: if you will only let the algorithm run for a small number of plays (say \(M\)) and you have many arms to explore (say \(N\)), it’s possible that UCB1 will just try every single arm in succession and not even make it to the end. If \(M < N\), this is definitely going to occur. If \(M\) is close to \(N\), you’ll still spend a lot of time just doing this initial walkthrough.

But, if you have a lot of plays ahead of you, this initial pass through all of the arms is a very good thing. It ensures that the UCB algorithm knows a little bit about all available options, which makes it very effective when there are clearly inferior arms that can be essentially ignored right from start.

The select_arm method for UCB1 uses a special type of value that we’ve called a ucb_value in this code. The ucb_value combines the simple estimated value of each arm with a special bonus quantity, which is __math.sqrt((2*math.log(total_counts)) / float(self.counts[arm]))__. The most basic statement that can be made about it is that it augments the estimated value of any arm with a measure of how much less we know about that arm than we know about the other arms.

The effect of that is that UCB is a explicitly curious algorithm that tries to seek out the unknown. The other factors around this core unit of curiosity are essentially rescaling terms that make UCB work properly. These rescaling terms allow the algorithm to define a confidence interval that has a reasonable chance of containing the true value of the arm inside of it.

UCB creates its ucb_values by replacing every arm’s estimated value with the upper bound on the confidence interval for its value. This is why the algorithm is the Upper Confidence Bound algorithm. But, setting aside issues of confidence bounds, the big idea that drives UCB is present in just dividing math.log(total_counts)) by float(self.counts[arm]). As we said above, this quantity becomes a big boost in the effective value of the arm for arms that we know little about. That means that we try hard to learn about arms if we don’t know enough about them, even if they seem a little worse than the best arm. In fact, this curiosity bonus means we’ll even occasionally visit the worst of the arms we have available.

## Testing algorithms using Monte Carlo

```
def test_algorithm(algo, arms, num_sims, horizon):
= []
results for sim in range(num_sims):
len(arms))
algo.initialize(for t in range(horizon):
= algo.select_arm()
chosen_arm = arms[chosen_arm].draw()
reward
algo.update(chosen_arm, reward)"Sim No": sim, "Time": t, "Chosen Arm": chosen_arm,
results.append({"Reward": reward})
return pd.DataFrame(results)
```

How does this function work?

We pass in a few objects:

- A bandit algorithm that we want to test.
- An array of arms we want to simulate draws from.
- A fixed number of simulations to run to average over the noise in each simulation.
- The number of times each algorithm is allowed to pull on arms during each simulation.

The framework then uses these objects to run many independent simulations. For each of these, it:

- Initializes the bandit algorithm’s settings from scratch so that it has no prior knowledge about which arm is best
- Loops over opportunities to pull an arm. On each step of this loop, it:
- Calls select_arm to see which arm the algorithm chooses
- Calls draw on that arm to simulate the result of pulling that arm.
- Records the amount of reward received by the algorithm and then calls update to let the algorithm process that new piece of information

Finally, the testing framework returns a data set that tells us for each simulation which arm was chosen and how well the algorithm did at each point in time.

### Sample code for testing algorithm performance

```
1)
random.seed(= [0.1, 0.1, 0.1, 0.1, 0.9]
means = len(means)
n_arms
random.shuffle(means)= [BernoulliArm(mu) for mu in means]
arms print("Best arm is " + str(ind_max(means)))
= {}
results for epsilon in [0.1, 0.2, 0.3, 0.4, 0.5]:
= EpsilonGreedy(epsilon)
algo "EpsGreedy",epsilon)] = test_algorithm(algo, arms, 5000, 250)
results[("EpsGreedy", epsilon)].groupby("Time").mean()["Reward"].plot(kind='line') results[(
```