VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks, with different input modalities (e.g., vision-language, audio-language and audiovisual-language), and achieves new state-of-the-art performances on series of public cross-modality benchmarks.
TL;DR
AI KEY POINTS
ABSTRACT
PAPER
Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks, with different input modalities (e.g., vision-language, audio-language and audiovisual-language), and achieves new state-of-the-art performances on series of public cross-modality benchmarks.
Research is provided by Semantic Scholar and AI-generated text may at times produce inaccurate results.
Information provided on this site does not constitute legal, financial, medical, or any other professional advice.
DATA LICENSING
Search and article data is provided under CC BY-NC or ODC-BY and via The Semantic Scholar Open Data Platform. Read more at Kinney, Rodney Michael et al. “The Semantic Scholar Open Data Platform.” ArXiv abs/2301.10140 (2023): n. pag.