Unified Visual Relationship Detection with Vision and Language Models
VLM for scene understanding (VRD). DETR-like object detector (with bounding box prediction) and Perceiver Resampler for relationship decoder.
My summary on HFPapers: https://huggingface.co/papers/2303.08998#64ff22002597506d5adf7966
arXiv: https://arxiv.org/abs/2303.08998