VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks, with different input modalities (e.g., vision-language, audio-language and audiovisual-language), and achieves new state-of-the-art performances on series of public cross-modality benchmarks.

Mon Apr 17 2023
Citations
40
by Sihan Chen, Xingjian He and others
CHAT WITH RESEARCH


QUESTIONS & ANSWERS

Log in to generate
TL;DR
AI KEY POINTS
ABSTRACT
PAPER
Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks, with different input modalities (e.g., vision-language, audio-language and audiovisual-language), and achieves new state-of-the-art performances on series of public cross-modality benchmarks.


Research is provided by Semantic Scholar and AI-generated text may at times produce inaccurate results.
Information provided on this site does not constitute legal, financial, medical, or any other professional advice.

DATA LICENSING
Search and article data is provided under CC BY-NC or ODC-BY and via The Semantic Scholar Open Data Platform. Read more at Kinney, Rodney Michael et al. “The Semantic Scholar Open Data Platform.” ArXiv abs/2301.10140 (2023): n. pag.