Researchers at Stanford University Propose Locality Alignment: A New Post-Training Stage for Vision Transformers ViTs
Vision-Language Models (VLMs) struggle with spatial reasoning tasks like object localization, counting, and relational question-answering. This issue stems from Vision Transformers (ViTs) trained with image-level supervision, which often fail to…