Deep Modular Co-Attention Networks for Visual Question Answering