Went back to BLIP (https://arxiv.org/abs/2201.12086) last night. When I first skimmed it, I focused on the part of the paper focused on bootstrapping captions, but the "Multimodal mixture of Encoder-Decoder" architecture is pretty cool.
It uses a structured architecture involving multiple encoder/decoders wherein some parts of the architecture take advantage of others (e.g. using the contrastive loss for hard example mining for the image-text matching loss).