Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junyang Chen

SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors

Mar 20, 2025

Yang Chen, Hui Wang, Shiyao Wang, Junyang Chen, Jiabei He, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

Abstract:While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.

Via

Access Paper or Ask Questions

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Jan 03, 2025

Heqing Zou, Tianze Luo, Guiyang Xie, Victor, Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang(+1 more)

Figure 1 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Figure 2 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Figure 3 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Figure 4 for HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Abstract:Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

Via

Access Paper or Ask Questions

FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Nov 27, 2024

Junyang Chen, Jinshan Pan, Jiangxin Dong

Figure 1 for FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Figure 2 for FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Figure 3 for FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Figure 4 for FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Abstract:Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.

* Project page: https://jychen9811.github.io/FaithDiff_page/

Via

Access Paper or Ask Questions

Attributed Graph Clustering via Generalized Quaternion Representation Learning

Nov 22, 2024

Junyang Chen, Yiqun Zhang, Mengke Li, Yang Lu, Yiu-ming Cheung

Figure 1 for Attributed Graph Clustering via Generalized Quaternion Representation Learning

Figure 2 for Attributed Graph Clustering via Generalized Quaternion Representation Learning

Figure 3 for Attributed Graph Clustering via Generalized Quaternion Representation Learning

Figure 4 for Attributed Graph Clustering via Generalized Quaternion Representation Learning

Abstract:Clustering complex data in the form of attributed graphs has attracted increasing attention, where appropriate graph representation is a critical prerequisite for accurate cluster analysis. However, the Graph Convolutional Network will homogenize the representation of graph nodes due to the well-known over-smoothing effect. This limits the network architecture to a shallow one, losing the ability to capture the critical global distribution information for clustering. Therefore, we propose a generalized graph auto-encoder network, which introduces quaternion operations to the encoders to achieve efficient structured feature representation learning without incurring deeper network and larger-scale parameters. The generalization of our method lies in the following two aspects: 1) connecting the quaternion operation naturally suitable for four feature components with graph data of arbitrary attribute dimensions, and 2) introducing a generalized graph clustering objective as a loss term to obtain clustering-friendly representations without requiring a pre-specified number of clusters $k$. It turns out that the representations of nodes learned by the proposed Graph Clustering based on Generalized Quaternion representation learning (GCGQ) are more discriminative, containing global distribution information, and are more general, suiting downstream clustering under different $k$s. Extensive experiments including significance tests, ablation studies, and qualitative results, illustrate the superiority of GCGQ. The source code is temporarily opened at \url{https://anonymous.4open.science/r/ICLR-25-No7181-codes}.

Via

Access Paper or Ask Questions

CUPID: Improving Battle Fairness and Position Satisfaction in Online MOBA Games with a Re-matchmaking System

Jun 28, 2024

Ge Fan, Chaoyun Zhang, Kai Wang, Yingjie Li, Junyang Chen, Zenglin Xu

Figure 1 for CUPID: Improving Battle Fairness and Position Satisfaction in Online MOBA Games with a Re-matchmaking System

Figure 2 for CUPID: Improving Battle Fairness and Position Satisfaction in Online MOBA Games with a Re-matchmaking System

Figure 3 for CUPID: Improving Battle Fairness and Position Satisfaction in Online MOBA Games with a Re-matchmaking System

Figure 4 for CUPID: Improving Battle Fairness and Position Satisfaction in Online MOBA Games with a Re-matchmaking System

Abstract:The multiplayer online battle arena (MOBA) genre has gained significant popularity and economic success, attracting considerable research interest within the Human-Computer Interaction community. Enhancing the gaming experience requires a deep understanding of player behavior, and a crucial aspect of MOBA games is matchmaking, which aims to assemble teams of comparable skill levels. However, existing matchmaking systems often neglect important factors such as players' position preferences and team assignment, resulting in imbalanced matches and reduced player satisfaction. To address these limitations, this paper proposes a novel framework called CUPID, which introduces a novel process called ``re-matchmaking'' to optimize team and position assignments to improve both fairness and player satisfaction. CUPID incorporates a pre-filtering step to ensure a minimum level of matchmaking quality, followed by a pre-match win-rate prediction model that evaluates the fairness of potential assignments. By simultaneously considering players' position satisfaction and game fairness, CUPID aims to provide an enhanced matchmaking experience. Extensive experiments were conducted on two large-scale, real-world MOBA datasets to validate the effectiveness of CUPID. The results surpass all existing state-of-the-art baselines, with an average relative improvement of 7.18% in terms of win prediction accuracy. Furthermore, CUPID has been successfully deployed in a popular online mobile MOBA game. The deployment resulted in significant improvements in match fairness and player satisfaction, as evidenced by critical Human-Computer Interaction (HCI) metrics covering usability, accessibility, and engagement, observed through A/B testing. To the best of our knowledge, CUPID is the first re-matchmaking system designed specifically for large-scale MOBA games.

* 38 pages, accepted by CSCW 24

Via

Access Paper or Ask Questions

ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning

Feb 29, 2024

Qin Zhang, Xiaowei Li, Jiexin Lu, Liping Qiu, Shirui Pan, Xiaojun Chen, Junyang Chen

$Figure 1 for ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning$

$Figure 2 for ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning$

$Figure 3 for ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning$

$Figure 4 for ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning$

Abstract:Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known classes. They are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. IND noise are training samples which are assigned incorrect labels. The existence of IND noise and OOD noise is prevalent, which usually cause the ambiguity problem, including the intra-class variety problem and the inter-class confusion problem. Thus, to explore robust open-set learning methods is necessary and difficult, and it becomes even more difficult for non-IID graph data.To this end, we propose a unified framework named ROG$_{PL}$ to achieve robust open-set learning on complex noisy graph data, by introducing prototype learning. In specific, ROG$_{PL}$ consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. The first module corrects noisy labels through similarity-based label propagation and removes low-confidence samples, to solve the intra-class variety problem caused by noise. The second module learns open-set prototypes for each known class via non-overlapped regions and remains both interior and border prototypes to remedy the inter-class confusion problem.The two modules are iteratively updated under the constraints of classification loss and prototype diversity loss. To the best of our knowledge, the proposed ROG$_{PL}$ is the first robust open-set node classification method for graph data with complex noise.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions

ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning

Dec 21, 2023

Jingcai Guo, Qihua Zhou, Ruibing Li, Xiaocheng Lu, Ziming Liu, Junyang Chen, Xin Xie, Jie Zhang

Figure 1 for ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning

Figure 2 for ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning

Figure 3 for ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning

Figure 4 for ParsNets: A Parsimonious Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning

Abstract:This paper provides a novel parsimonious yet efficient design for zero-shot learning (ZSL), dubbed ParsNets, where we are interested in learning a composition of on-device friendly linear networks, each with orthogonality and low-rankness properties, to achieve equivalent or even better performance against existing deep models. Concretely, we first refactor the core module of ZSL, i.e., visual-semantics mapping function, into several base linear networks that correspond to diverse components of the semantic space, where the complex nonlinearity can be collapsed into simple local linearities. Then, to facilitate the generalization of local linearities, we construct a maximal margin geometry on the learned features by enforcing low-rank constraints on intra-class samples and high-rank constraints on inter-class samples, resulting in orthogonal subspaces for different classes and each subspace lies on a compact manifold. To enhance the model's adaptability and counterbalance over/under-fittings in ZSL, a set of sample-wise indicators is employed to select a sparse subset from these base linear networks to form a composite semantic predictor for each sample. Notably, maximal margin geometry can guarantee the diversity of features, and meanwhile, local linearities guarantee efficiency. Thus, our ParsNets can generalize better to unseen classes and can be deployed flexibly on resource-constrained devices. Theoretical explanations and extensive experiments are conducted to verify the effectiveness of the proposed method.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Nov 15, 2023

Junyang Chen, Hanjiang Lai

Abstract:Zero-shot composed image retrieval (ZS-CIR), which aims to retrieve a target image based on textual modifications to a reference image without triplet labeling, has gained more and more attention. Current ZS-CIR research mainly relies on two unlabeled pre-trained models: the vision-language model, e.g., CLIP, and the Pic2Word/textual inversion model. However, the pre-trained models and CIR tasks have substantial discrepancies, where the pre-trained models learn the similarities between vision and language but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach to reduce the gap between the pre-trained model and the downstream CIR task. We first reformulate the pre-trained vision-language contrastive learning as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triple from an image-text pair. Then, we propose a masked tuning, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, it can learn to capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on three ZS-CIR datasets, including FashionIQ, CIRR, and CIRCO.

Via

Access Paper or Ask Questions

Ranking-aware Uncertainty for Text-guided Image Retrieval

Aug 16, 2023

Junyang Chen, Hanjiang Lai

Figure 1 for Ranking-aware Uncertainty for Text-guided Image Retrieval

Figure 2 for Ranking-aware Uncertainty for Text-guided Image Retrieval

Figure 3 for Ranking-aware Uncertainty for Text-guided Image Retrieval

Figure 4 for Ranking-aware Uncertainty for Text-guided Image Retrieval

Abstract:Text-guided image retrieval is to incorporate conditional text to better capture users' intent. Traditionally, the existing methods focus on minimizing the embedding distances between the source inputs and the targeted image, using the provided triplets $\langle$source image, source text, target image$\rangle$. However, such triplet optimization may limit the learned retrieval model to capture more detailed ranking information, e.g., the triplets are one-to-one correspondences and they fail to account for many-to-many correspondences arising from semantic diversity in feedback languages and images. To capture more ranking information, we propose a novel ranking-aware uncertainty approach to model many-to-many correspondences by only using the provided triplets. We introduce uncertainty learning to learn the stochastic ranking list of features. Specifically, our approach mainly comprises three components: (1) In-sample uncertainty, which aims to capture semantic diversity using a Gaussian distribution derived from both combined and target features; (2) Cross-sample uncertainty, which further mines the ranking information from other samples' distributions; and (3) Distribution regularization, which aligns the distributional representations of source inputs and targeted image. Compared to the existing state-of-the-art methods, our proposed method achieves significant results on two public datasets for composed image retrieval.

Via

Access Paper or Ask Questions

MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical Report Generation

Apr 15, 2023

Ruizhi Wang, Xiangtao Wang, Zhenghua Xu, Wenting Xu, Junyang Chen, Thomas Lukasiewicz

Abstract:In clinical scenarios, multiple medical images with different views are usually generated at the same time, and they have high semantic consistency. However, the existing medical report generation methods cannot exploit the rich multi-view mutual information of medical images. Therefore, in this work, we propose the first multi-view medical report generation model, called MvCo-DoT. Specifically, MvCo-DoT first propose a multi-view contrastive learning (MvCo) strategy to help the deep reinforcement learning based model utilize the consistency of multi-view inputs for better model learning. Then, to close the performance gaps of using multi-view and single-view inputs, a domain transfer network is further proposed to ensure MvCo-DoT achieve almost the same performance as multi-view inputs using only single-view inputs.Extensive experiments on the IU X-Ray public dataset show that MvCo-DoT outperforms the SOTA medical report generation baselines in all metrics.

* Received by the ICASSP2023

Via

Access Paper or Ask Questions