Fixed mistake in find_mu_min() implementation
I noticed a mistake in the implementation of find_mu_min() (in whittle_computation.py), that affects cases of equal Whittle indexes. In more detail, when considering computed mu_i_k values (defined on line 13, page 22 of the paper), it skips all the next mu_i_k values that are roughly equal to the current mu. So, when calculating equal (or very close) mu values, it produces unexpected results, leading to false Whittle indexes and false positives regarding indexability.
Examples :
Input :
eps = 0.01
# transition matrix for active action
active_mat = np.array([[0, 1, 0],
[0, 1-eps, eps],
[0, 0, 1]])
active_reward = np.array([0, 0, 0])
# transition matrix for passive action
passive_mat = np.array([[0, 0, 1],
[0, 0, 1],
[0, 0, 1]])
passive_reward = np.array([0, 0, -1 ])
Expected result : Not indexable, returns [0, 0, 1]
Actual result : Indexable, returns [0, 1, 1]
Input :
eps = 0.01
# transition matrix for active action
active_mat = np.array([[1-eps, 0, eps],
[0, 1-eps, eps],
[0, 0, 1]])
active_reward = np.array([0, 0, 0])
# transition matrix for passive action
passive_mat = np.array([[1-eps, 0, eps],
[0, 1-eps, eps],
[0, 0, 1]])
passive_reward = np.array([0, 0, -1 ])
Expected result : Indexable, returns [0, 0, 1]
Actual result : Indexable, returns [0, 1, 1]
Code problem description :
Apart from the bad practice of catching the division-by-zero case by a try-except statement, the call to np.where() has no reason to be. Hence I implemented the following :
def find_mu_min(y, z, current_mu, atol):
"""
Find the smallest mu_i^k
"""
nb_elems = z.shape[0]
mu_i_k = np.empty(nb_elems)
for i in range(nb_elems):
if abs(z[i]) < atol: # if z[i] == 0
mu_i_k[i] = current_mu
elif z[i] > atol and 1.0 - y[i] > atol: # if z[i] and 1 - y[i] are > 0
mu_i_k[i] = current_mu + z[i]/(1.0-y[i])
else:
mu_i_k[i] = np.inf
argmin = mu_i_k.argmin()
return argmin, mu_i_k[argmin]
When generating randomly hundreds of thousands of MDPs of different sizes and using different methods, this new implementation produces the same Whittle indexes when both implementations find that MDPs are indexable and their Whittle indexes are distinct. Moreover, when the new implementation finds that an arm is not indexable, so does the old one.