LinearBanditAgents stuck always selecting only the first action with tikhonov_weight=0
Hey everyone!
I've been recently playing with Linear Bandits and during one of my experiments I have found out that neither LinTS nor LinUCB agents work with tikhonov_weight=0. While there is no error message, the (collect) policy of these bandits will get stuck selecting action_0 (the first action) no matter the context. Since the (collect) policy never selects any other action, only the oracle for action_0 can be trained.
I have traced the origin of the error to the LinearBanditVariableCollection, where each cov_matrix is initialized as a zero matrix.
https://github.com/tensorflow/agents/blob/69f5ba2e76ba8c3a1529fa0e3bf08a2e5cd3d68d/tf_agents/bandits/agents/linear_bandit_agent.py#L81-L84
When we use the (collect) policy before first training, the linear_bandit_policy._distribution tries to solve for beta as follows:
(cov_matrix + tikhonov_weight * tf.eye) * beta = current_context
https://github.com/tensorflow/agents/blob/69f5ba2e76ba8c3a1529fa0e3bf08a2e5cd3d68d/tf_agents/bandits/policies/linear_bandit_policy.py#L266-L269
With cov_matrix being a zero matrix and tikhonov_weight=0, the LHS is always zero while the RHS is clearly not, so there is no solution (the result is infinity). This further propagates up to the tf.argmax over the expected rewards for each action, which receives several NaN values and it will always choose the first position.
https://github.com/tensorflow/agents/blob/69f5ba2e76ba8c3a1529fa0e3bf08a2e5cd3d68d/tf_agents/bandits/policies/linear_bandit_policy.py#L306-L309
After first training, the cov_matrix for action_0 is nonzero, so the equation (cov_matrix + tikhonov_weight * tf.eye) * beta = context has a real-valued solution, while it remains infinity for all other actions. This once again propagates to the tf.argmax, which receives a tensor with one real value (for action_0) and NaN for every other action and will thus always select the first position.
And this is how the Linear Bandits get stuck always selecting just the first action.
I believe the most straightforward solution is to initialise the cov_matrix_list in LinearBanditVariableCollection as tf.eye() instead of tf.zeros(), which corresponds to the algorithm description in both of the original papers:
I'm happy to open a PR, but I wanted to discuss the change first and I wasn't sure if PR is needed for such a simple change.
Best regards, Michal