MiniVLM更快更小的视觉语言模型 Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer