Image quality assessment (IQA) mainly focuses on the impact of distortions and other quality issues in images on human perception.
In comparison with IQA, image aesthetic assessment (IAA) (Murray et al., 2012) is a more complicated task for visual scoring. While visual quality is also considered influential to visual aesthetics, the higher-level visual attributes, such as content, lighting, color, composition (Kong et al., 2016) are considered more important for IAA.
Based on LMMs with rich prior knowledge, the proposed Q-Align can remarkably outperform CLIP-based approaches without extra pre-training.
Named as video quality assessment (VQA), the focus of this task is also kind of complicated, that several studies have claimed that scores are not only affected by quality issues, but also contents (Li et al., 2019), and even aesthetics (Wu et al., 2023d).
Nevertheless, while the goal of VQA is similar to IQA (or IAA), the need to input videos has hindered methods to tackle this task with the same modeling structure as image scoring approaches.
A typical example is the CLIP-based attempts: as CLIP is image-based, though it can achieve good zero-shot VQA capabilities through a frame-by-frame inference (Wu et al., 2023b), training CLIP-based methods on VQA datasets is extremely challenging (Wu et al., 2023c) and performs worse than specially-designed VQA models.
In the proposed Q-Align, we utilize the language decoder to assemble videos as sequences of frames, so as to unify VQA with IQA/IAA under one structure, outperforming complicated specifically-designed architectures.
Some recent investigations have discussed the possibilities for adopting Large Multimodality Models (LMMs) for visual scoring.
<img> Rate the quality of the image.
<LEVEL>
token of LMMs is the probability distribution (denoted as #User: <img> Can you evaluate the quality of the image?
#Assistant: The quality of the image is <level>.
#User: <img> How is the aesthetics of the image?
#Assistant: The aesthetics of the image is <level>.
#User: <img> Rate the quality of the video.
#Assistant: The quality of the video is <level>.
The user queries are randomly chosen from a group of paraphrases as an augmentation. Following Zheng et al. (2023), only the LMM responses (after #Assistant:) are supervised.
* 這篇論文透過以離散的文字定義層級指導 LMMs 在視覺評分領域得到重大的進步 * 這篇叫做 Q-Align 的 syllabus 在三種 tasks 上面有顯著的改進,並且進一步的統一成一個 OneAlign 模型 * Q-Align 有很大的潛力,開啟了一個新的方向