Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Zhou

DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Apr 01, 2024

Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie Zhou, Jiwen Lu

Figure 1 for DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Figure 2 for DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Figure 3 for DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Figure 4 for DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Abstract:The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes.

* Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Via

Access Paper or Ask Questions

On Large Language Models' Hallucination with Regard to Known Facts

Mar 29, 2024

Che Jiang, Biqing Qi, Xiangyu Hong, Dayuan Fu, Yang Cheng, Fandong Meng, Mo Yu, Bowen Zhou, Jie Zhou

Figure 1 for On Large Language Models' Hallucination with Regard to Known Facts

Figure 2 for On Large Language Models' Hallucination with Regard to Known Facts

Figure 3 for On Large Language Models' Hallucination with Regard to Known Facts

Figure 4 for On Large Language Models' Hallucination with Regard to Known Facts

Abstract:Large language models are successful in answering factoid questions but are also prone to hallucination.We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics, an area not previously covered in studies on hallucinations.We are able to conduct this analysis via two key ideas.First, we identify the factual questions that query the same triplet knowledge but result in different answers. The difference between the model behaviors on the correct and incorrect outputs hence suggests the patterns when hallucinations happen. Second, to measure the pattern, we utilize mappings from the residual streams to vocabulary space. We reveal the different dynamics of the output token probabilities along the depths of layers between the correct and hallucinated cases. In hallucinated cases, the output token's information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model. Leveraging the dynamic curve as a feature, we build a classifier capable of accurately detecting hallucinatory predictions with an 88\% success rate. Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.

* Accepted by NAACL 2024 MainConference

Via

Access Paper or Ask Questions

Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check

Mar 27, 2024

Linhao Ye, Zhikai Lei, Jianghao Yin, Qin Chen, Jie Zhou, Liang He

Abstract:Retrieval-Augmented Generation (RAG) aims to generate more reliable and accurate responses, by augmenting large language models (LLMs) with the external vast and dynamic knowledge. Most previous work focuses on using RAG for single-round question answering, while how to adapt RAG to the complex conversational setting wherein the question is interdependent on the preceding context is not well studied. In this paper, we propose a conversation-level RAG approach, which incorporates fine-grained retrieval augmentation and self-check for conversational question answering (CQA). In particular, our approach consists of three components, namely conversational question refiner, fine-grained retriever and self-check based response generator, which work collaboratively for question understanding and relevant information acquisition in conversational settings. Extensive experiments demonstrate the great advantages of our approach over the state-of-the-art baselines. Moreover, we also release a Chinese CQA dataset with new features including reformulated question, extracted keyword, retrieved paragraphs and their helpfulness, which facilitates further researches in RAG enhanced CQA.

Via

Access Paper or Ask Questions

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Mar 21, 2024

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu

Figure 1 for Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Figure 2 for Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Figure 3 for Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Figure 4 for Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Abstract:In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications. However, it is challenging for the visual encoder in Large Vision-Language Models (LVLMs) to extract useful features tailored to questions that aid the language model's response. Furthermore, a common practice among existing LVLMs is to utilize lower-resolution images, which restricts the ability for visual recognition. Our work introduces the Chain-of-Spot (CoS) method, which we describe as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows LVLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features. By integrating Chain-of-Spot with instruct-following LLaVA-1.5 models, the process of image reasoning consistently improves performance across a wide range of multimodal datasets and benchmarks without bells and whistles and achieves new state-of-the-art results. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content, paving the way for more sophisticated visual instruction-following applications. Code and models are available at https://github.com/dongyh20/Chain-of-Spot

* Project Page: https://sites.google.com/view/chain-of-spot/

Via

Access Paper or Ask Questions

Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering

Mar 17, 2024

Baiyan Zhang, Qin Chen, Jie Zhou, Jian Jin, Liang He

Abstract:Document-level Event Causality Identification (DECI) aims to identify causal relations between two events in documents. Recent research tends to use pre-trained language models to generate the event causal relations. Whereas, these methods are prone to the errors of sequential generation due to multiple events in a document. Moreover, the potential structures such as event coreference and related causal chain are neglected. In this paper, we propose a multi-task learning framework to enhance event causality identification with rationale and structure-aware causal question answering. Specifically, the DECI task is transformed into multiple-choice question answering, and the causes and effects of the questioned event are generated with large language models. In addition, we generate the rationales to explain why these events have causal relations. Moreover, we construct an event structure graph, which models the multi-hop potential relations for causal reasoning of the current event. Experiments on two benchmark datasets show the great advantages of our proposed approach compared to the state-of-the-art methods. Moreover, we conduct both quantitative and qualitative analyses, which shed light on why each component of our approach can lead to great improvements.

Via

Access Paper or Ask Questions

RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Mar 17, 2024

Qingrong Sun, Weixiang Zhong, Jie Zhou, Chong Lai, Xiaodong Teng, Maode Lai

Figure 1 for RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Figure 2 for RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Figure 3 for RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Figure 4 for RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Abstract:The annotation of digital pathological slide data for renal cell carcinoma is of paramount importance for correct diagnosis of artificial intelligence models due to the heterogeneous nature of the tumor. This process not only facilitates a deeper understanding of renal cell cancer heterogeneity but also aims to minimize noise in the data for more accurate studies. To enhance the applicability of the data, two pathologists were enlisted to meticulously curate, screen, and label a kidney cancer pathology image dataset from The Cancer Genome Atlas Program (TCGA) database. Subsequently, a Resnet model was developed to validate the annotated dataset against an additional dataset from the First Affiliated Hospital of Zhejiang University. Based on these results, we have meticulously compiled the TCGA digital pathological dataset with independent labeling of tumor regions and adjacent areas (RCdpia), which includes 109 cases of kidney chromophobe cell carcinoma, 486 cases of kidney clear cell carcinoma, and 292 cases of kidney papillary cell carcinoma. This dataset is now publicly accessible at http://39.171.241.18:8888/RCdpia/. Furthermore, model analysis has revealed significant discrepancies in predictive outcomes when applying the same model to datasets from different centers. Leveraging the RCdpia, we can now develop more precise digital pathology artificial intelligence models for tasks such as normalization, classification, and segmentation. These advancements underscore the potential for more nuanced and accurate AI applications in the field of digital pathology.

* 8 pages, 3 figures, 1 table

Via

Access Paper or Ask Questions

Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Mar 16, 2024

Zhiheng Li, Muheng Li, Jixuan Fan, Lei Chen, Yansong Tang, Jie Zhou, Jiwen Lu

Figure 1 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Figure 2 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Figure 3 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Figure 4 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Abstract:Scale arbitrary super-resolution based on implicit image function gains increasing popularity since it can better represent the visual world in a continuous manner. However, existing scale arbitrary works are trained and evaluated on simulated datasets, where low-resolution images are generated from their ground truths by the simplest bicubic downsampling. These models exhibit limited generalization to real-world scenarios due to the greater complexity of real-world degradations. To address this issue, we build a RealArbiSR dataset, a new real-world super-resolution benchmark with both integer and non-integer scaling factors for the training and evaluation of real-world scale arbitrary super-resolution. Moreover, we propose a Dual-level Deformable Implicit Representation (DDIR) to solve real-world scale arbitrary super-resolution. Specifically, we design the appearance embedding and deformation field to handle both image-level and pixel-level deformations caused by real-world degradations. The appearance embedding models the characteristics of low-resolution inputs to deal with photometric variations at different scales, and the pixel-based deformation field learns RGB differences which result from the deviations between the real-world and simulated degradations at arbitrary coordinates. Extensive experiments show our trained model achieves state-of-the-art performance on the RealArbiSR and RealSR benchmarks for real-world scale arbitrary super-resolution. Our dataset as well as source code will be publicly available.

Via

Access Paper or Ask Questions

Enhancing Depression-Diagnosis-Oriented Chat with Psychological State Tracking

Mar 12, 2024

Yiyang Gu, Yougen Zhou, Qin Chen, Ningning Zhou, Jie Zhou, Aimin Zhou, Liang He

Abstract:Depression-diagnosis-oriented chat aims to guide patients in self-expression to collect key symptoms for depression detection. Recent work focuses on combining task-oriented dialogue and chitchat to simulate the interview-based depression diagnosis. Whereas, these methods can not well capture the changing information, feelings, or symptoms of the patient during dialogues. Moreover, no explicit framework has been explored to guide the dialogue, which results in some useless communications that affect the experience. In this paper, we propose to integrate Psychological State Tracking (POST) within the large language model (LLM) to explicitly guide depression-diagnosis-oriented chat. Specifically, the state is adapted from a psychological theoretical model, which consists of four components, namely Stage, Information, Summary and Next. We fine-tune an LLM model to generate the dynamic psychological state, which is further used to assist response generation at each turn to simulate the psychiatrist. Experimental results on the existing benchmark show that our proposed method boosts the performance of all subtasks in depression-diagnosis-oriented chat.

Via

Access Paper or Ask Questions

Memory-based Adapters for Online 3D Scene Perception

Mar 11, 2024

Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, Jiwen Lu

Abstract:In this paper, we propose a new framework for online 3D scene perception. Conventional 3D scene perception methods are offline, i.e., take an already reconstructed 3D scene geometry as input, which is not applicable in robotic applications where the input data is streaming RGB-D videos rather than a complete 3D scene reconstructed from pre-collected RGB-D videos. To deal with online 3D scene perception tasks where data collection and perception should be performed simultaneously, the model should be able to process 3D scenes frame by frame and make use of the temporal information. To this end, we propose an adapter-based plug-and-play module for the backbone of 3D scene perception model, which constructs memory to cache and aggregate the extracted RGB-D features to empower offline models with temporal learning ability. Specifically, we propose a queued memory mechanism to cache the supporting point cloud and image features. Then we devise aggregation modules which directly perform on the memory and pass temporal information to current frame. We further propose 3D-to-2D adapter to enhance image features with strong global context. Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate our approach achieves leading performance on three 3D scene perception tasks compared with state-of-the-art online methods by simply finetuning existing offline models, without any model and task-specific designs. \href{https://xuxw98.github.io/Online3D/}{Project page}.

* Accepted to CVPR24. Link: https://xuxw98.github.io/Online3D/

Via

Access Paper or Ask Questions

DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Mar 01, 2024

Kedi Chen, Qin Chen, Jie Zhou, Yishen He, Liang He

Figure 1 for DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Figure 2 for DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Figure 3 for DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Figure 4 for DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models

Abstract:Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.

Via

Access Paper or Ask Questions