In recent years, Vision Transformer-based approaches for low-level vision tasks have achieved widespread success. Unlike CNN-based models, Transformers are more adept at capturing long-range dependencies, enabling the reconstruction of images utilizing non-local information. In the domain of super-resolution, Swin-transformer-based models have become mainstream due to their capability of global spatial information modeling and their shifting-window attention mechanism that facilitates the interchange of information between different windows. Many researchers have enhanced model performance by expanding the receptive fields or designing meticulous networks, yielding commendable results. However, we observed that it is a general phenomenon for the feature map intensity to be abruptly suppressed to small values towards the network's end. This implies an information bottleneck and a diminishment of spatial information, implicitly limiting the model's potential. To address this, we propose the Dense-residual-connected Transformer (DRCT), aimed at mitigating the loss of spatial information and stabilizing the information flow through dense-residual connections between layers, thereby unleashing the model's potential and saving the model away from information bottleneck. Experiment results indicate that our approach surpasses state-of-the-art methods on benchmark datasets and performs commendably at the NTIRE-2024 Image Super-Resolution (x4) Challenge. Our source code is available at https://github.com/ming053l/DRCT
This paper introduces the concept of accessibility from the field of transportation planning and adopts it within the context of Information Retrieval (IR). An analogy is drawn between the fields, which motivates the development of document accessibility measures for IR systems. Considering the accessibility of documents within a collection given an IR System provides a different perspective on the analysis and evaluation of such systems which could be used to inform the design, tuning and management of current and future IR systems.
Transformer based methods have achieved great success in image inpainting recently. However, we find that these solutions regard each pixel as a token, thus suffering from an information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration. 2) They quantize $256^3$ RGB values to a small number (such as 512) of quantized color values. The indices of quantized pixels are used as tokens for the inputs and prediction targets of the transformer. To mitigate these issues, we propose a new transformer based framework called "PUT". Specifically, to avoid input downsampling while maintaining computation efficiency, we design a patch-based auto-encoder P-VQVAE. The encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from the inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by input quantization, an Un-quantized Transformer is applied. It directly takes features from the P-VQVAE encoder as input without any quantization and only regards the quantized tokens as prediction targets. Furthermore, to make the inpainting process more controllable, we introduce semantic and structural conditions as extra guidance. Extensive experiments show that our method greatly outperforms existing transformer based methods on image fidelity and achieves much higher diversity and better fidelity than state-of-the-art pluralistic inpainting methods on complex large-scale datasets (e.g., ImageNet). Codes are available at https://github.com/liuqk3/PUT.
In this chapter, we consider generative information retrieval evaluation from two distinct but interrelated perspectives. First, large language models (LLMs) themselves are rapidly becoming tools for evaluation, with current research indicating that LLMs may be superior to crowdsource workers and other paid assessors on basic relevance judgement tasks. We review past and ongoing related research, including speculation on the future of shared task initiatives, such as TREC, and a discussion on the continuing need for human assessments. Second, we consider the evaluation of emerging LLM-based generative information retrieval (GenIR) systems, including retrieval augmented generation (RAG) systems. We consider approaches that focus both on the end-to-end evaluation of GenIR systems and on the evaluation of a retrieval component as an element in a RAG system. Going forward, we expect the evaluation of GenIR systems to be at least partially based on LLM-based assessment, creating an apparent circularity, with a system seemingly evaluating its own output. We resolve this apparent circularity in two ways: 1) by viewing LLM-based assessment as a form of "slow search", where a slower IR system is used for evaluation and training of a faster production IR system; and 2) by recognizing a continuing need to ground evaluation in human assessment, even if the characteristics of that human assessment must change.
In Pliable Private Information Retrieval (PPIR) with a single server, messages are partitioned into $\Gamma$ non-overlapping classes \cite{ref5}. The user wants to retrieve a message from its desired class without revealing the identity of the desired class to the server. In \cite{ref6}, Obead et al. consider the problem of PPIR with Side Information (PPIR-SI), where the user now has side information. The user wants to retrieve any new message (not included in the side information) from its desired class without revealing the identity of the desired class and its side information. A scheme for the PPIR-SI is given in \cite{ref6} for the case when the users side information is unidentified, and this case is referred to as PPIR with Unidentifiable SI (PPIR-USI). In this paper, we study the problem of PPIR for the single server case when the side information is partially identifiable, and we term this case as PPIR with Identifiable Side Information (PPIR-ISI). The user is well aware of the identity of the side information belonging to $\eta$ number of classes, where $1\leq \eta \leq \Gamma$. We give a scheme for PPIR-ISI, and we prove that having identifiable side information is advantageous by comparing the rate of the proposed scheme to the rate of the PPIR-USI scheme given in \cite{ref6} for some cases. Further, we extend the problem of PPIR-ISI for multi-user case, where users can collaboratively generate the query sets, and we give a scheme for this problem.
Skeleton-based zero-shot action recognition aims to recognize unknown human actions based on the learned priors of the known skeleton-based actions and a semantic descriptor space shared by both known and unknown categories. However, previous works focus on establishing the bridges between the known skeleton representation space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories. To address these challenges, we propose a novel method via Side information and dual-prompts learning for skeleton-based zero-shot action recognition (STAR) at the fine-grained level. Specifically, 1) we decompose the skeleton into several parts based on its topology structure and introduce the side information concerning multi-part descriptions of human body movements for alignment between the skeleton and the semantic space at the fine-grained level; 2) we design the visual-attribute and semantic-part prompts to improve the intra-class compactness within the skeleton space and inter-class separability within the semantic space, respectively, to distinguish the high-similarity actions. Extensive experiments show that our method achieves state-of-the-art performance in ZSL and GZSL settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.
The Information Bottleneck (IB) method is an information theoretical framework to design a parsimonious and tunable feature-extraction mechanism, such that the extracted features are maximally relevant to a specific learning or inference task. Despite its theoretical value, the IB is based on a functional optimization problem that admits a closed form solution only on specific cases (e.g., Gaussian distributions), making it difficult to be applied in most applications, where it is necessary to resort to complex and approximated variational implementations. To overcome this limitation, we propose an approach to adapt the closed-form solution of the Gaussian IB to a general task. Whichever is the inference task to be performed by a (possibly deep) neural-network, the key idea is to opportunistically design a regression sub-task, embedded in the original problem, where we can safely assume a (joint) multivariate normality between the sub-task's inputs and outputs. In this way we can exploit a fixed and pre-trained neural network to process the input data, using a tunable number of features, to trade data-size and complexity for accuracy. This approach is particularly useful every time a device needs to transmit data (or features) to a server that has to fulfil an inference task, as it provides a principled way to extract the most relevant features for the task to be executed, while looking for the best trade-off between the size of the feature vector to be transmitted, inference accuracy, and complexity. Extensive simulation results testify the effectiveness of the proposed methodhttps://info.arxiv.org/help/prep#comments and encourage to further investigate this research line.
Personalized fairness in recommendations has been attracting increasing attention from researchers. The existing works often treat a fairness requirement, represented as a collection of sensitive attributes, as a hyper-parameter, and pursue extreme fairness by completely removing information of sensitive attributes from the learned fair embedding, which suffer from two challenges: huge training cost incurred by the explosion of attribute combinations, and the suboptimal trade-off between fairness and accuracy. In this paper, we propose a novel Adaptive Fair Representation Learning (AFRL) model, which achieves a real personalized fairness due to its advantage of training only one model to adaptively serve different fairness requirements during inference phase. Particularly, AFRL treats fairness requirements as inputs and can learn an attribute-specific embedding for each attribute from the unfair user embedding, which endows AFRL with the adaptability during inference phase to determine the non-sensitive attributes under the guidance of the user's unique fairness requirement. To achieve a better trade-off between fairness and accuracy in recommendations, AFRL conducts a novel Information Alignment to exactly preserve discriminative information of non-sensitive attributes and incorporate a debiased collaborative embedding into the fair embedding to capture attribute-independent collaborative signals, without loss of fairness. Finally, the extensive experiments conducted on real datasets together with the sound theoretical analysis demonstrate the superiority of AFRL.
While preference-based recommendation algorithms effectively enhance user engagement by recommending personalized content, they often result in the creation of ``filter bubbles''. These bubbles restrict the range of information users interact with, inadvertently reinforcing their existing viewpoints. Previous research has focused on modifying these underlying algorithms to tackle this issue. Yet, approaches that maintain the integrity of the original algorithms remain largely unexplored. This paper introduces an Agent-based Information Neutrality model grounded in the Yin-Yang theory, namely, AbIN. This innovative approach targets the imbalance in information perception within existing recommendation systems. It is designed to integrate with these preference-based systems, ensuring the delivery of recommendations with neutral information. Our empirical evaluation of this model proved its efficacy, showcasing its capacity to expand information diversity while respecting user preferences. Consequently, AbIN emerges as an instrumental tool in mitigating the negative impact of filter bubbles on information consumption.
How are emotions formed? Through extensive debate and the promulgation of diverse theories , the theory of constructed emotion has become prevalent in recent research on emotions. According to this theory, an emotion concept refers to a category formed by interoceptive and exteroceptive information associated with a specific emotion. An emotion concept stores past experiences as knowledge and can predict unobserved information from acquired information. Therefore, in this study, we attempted to model the formation of emotion concepts using a constructionist approach from the perspective of the constructed emotion theory. Particularly, we constructed a model using multilayered multimodal latent Dirichlet allocation , which is a probabilistic generative model. We then trained the model for each subject using vision, physiology, and word information obtained from multiple people who experienced different visual emotion-evoking stimuli. To evaluate the model, we verified whether the formed categories matched human subjectivity and determined whether unobserved information could be predicted via categories. The verification results exceeded chance level, suggesting that emotion concept formation can be explained by the proposed model.