Neural Contextual Bandits with Deep Representation and Shallow Exploration

We study a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration).We prove that under standard assumptions, our proposed algorithm achieves $\tilde{O}(\sqrt{T})$ finite-time regret, where $T$ is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.

具有深度表示和浅探的神经上下文强盗

我们研究了一类一般的情境强盗,其中每个情境动作对都与一个原始特征向量相关联,但奖励生成功能未知。我们提出了一种新颖的学习算法,该算法使用深度ReLU神经网络的最后一个隐藏层(深度表示学习)来转换原始特征向量,并使用上置信界(UCB)方法在最后一个线性层中进行探索(浅层探索) 。.. 我们证明,在标准假设下,我们提出的算法可以实现 Ø〜(Ť) 有限时后悔 Ť 是学习时间的视野。与现有的神经上下文强盗算法相比,我们的方法在计算上效率更高,因为它只需要在深度神经网络的最后一层进行探索。 (阅读更多)