Sherman
Abstract:The high-accuracy and resource-intensive deep neural networks (DNNs) have been widely adopted by live video analytics (VA), where camera videos are streamed over the network to resource-rich edge/cloud servers for DNN inference. Common video encoding configurations (e.g., resolution and frame rate) have been identified with significant impacts on striking the balance between bandwidth consumption and inference accuracy and therefore their adaption scheme has been a focus of optimization. However, previous profiling-based solutions suffer from high profiling cost, while existing deep reinforcement learning (DRL) based solutions may achieve poor performance due to the usage of fixed reward function for training the agent, which fails to craft the application goals in various scenarios. In this paper, we propose ILCAS, the first imitation learning (IL) based configuration-adaptive VA streaming system. Unlike DRL-based solutions, ILCAS trains the agent with demonstrations collected from the expert which is designed as an offline optimal policy that solves the configuration adaption problem through dynamic programming. To tackle the challenge of video content dynamics, ILCAS derives motion feature maps based on motion vectors which allow ILCAS to visually ``perceive'' video content changes. Moreover, ILCAS incorporates a cross-camera collaboration scheme to exploit the spatio-temporal correlations of cameras for more proper configuration selection. Extensive experiments confirm the superiority of ILCAS compared with state-of-the-art solutions, with 2-20.9% improvement of mean accuracy and 19.9-85.3% reduction of chunk upload lag.
Abstract:Generative Diffusion Models (GDMs) have emerged as a transformative force in the realm of Generative Artificial Intelligence (GAI), demonstrating their versatility and efficacy across a variety of applications. The ability to model complex data distributions and generate high-quality samples has made GDMs particularly effective in tasks such as image generation and reinforcement learning. Furthermore, their iterative nature, which involves a series of noise addition and denoising steps, is a powerful and unique approach to learning and generating data. This paper serves as a comprehensive tutorial on applying GDMs in network optimization tasks. We delve into the strengths of GDMs, emphasizing their wide applicability across various domains, such as vision, text, and audio generation.We detail how GDMs can be effectively harnessed to solve complex optimization problems inherent in networks. The paper first provides a basic background of GDMs and their applications in network optimization. This is followed by a series of case studies, showcasing the integration of GDMs with Deep Reinforcement Learning (DRL), incentive mechanism design, Semantic Communications (SemCom), Internet of Vehicles (IoV) networks, etc. These case studies underscore the practicality and efficacy of GDMs in real-world scenarios, offering insights into network design. We conclude with a discussion on potential future directions for GDM research and applications, providing major insights into how they can continue to shape the future of network optimization.




Abstract:Limited by expensive pixel-level labels, polyp segmentation models are plagued by data shortage and suffer from impaired generalization. In contrast, polyp bounding box annotations are much cheaper and more accessible. Thus, to reduce labeling cost, we propose to learn a weakly supervised polyp segmentation model (i.e., WeakPolyp) completely based on bounding box annotations. However, coarse bounding boxes contain too much noise. To avoid interference, we introduce the mask-to-box (M2B) transformation. By supervising the outer box mask of the prediction instead of the prediction itself, M2B greatly mitigates the mismatch between the coarse label and the precise prediction. But, M2B only provides sparse supervision, leading to non-unique predictions. Therefore, we further propose a scale consistency (SC) loss for dense supervision. By explicitly aligning predictions across the same image at different scales, the SC loss largely reduces the variation of predictions. Note that our WeakPolyp is a plug-and-play model, which can be easily ported to other appealing backbones. Besides, the proposed modules are only used during training, bringing no computation cost to inference. Extensive experiments demonstrate the effectiveness of our proposed WeakPolyp, which surprisingly achieves a comparable performance with a fully supervised model, requiring no mask annotations at all.
Abstract:Extremely large-scale multiple-input-multiple-output (XL-MIMO), which offers vast spatial degrees of freedom, has emerged as a potentially pivotal enabling technology for the sixth generation (6G) of wireless mobile networks. With its growing significance, both opportunities and challenges are concurrently manifesting. This paper presents a comprehensive survey of research on XL-MIMO wireless systems. In particular, we introduce four XL-MIMO hardware architectures: uniform linear array (ULA)-based XL-MIMO, uniform planar array (UPA)-based XL-MIMO utilizing either patch antennas or point antennas, and continuous aperture (CAP)-based XL-MIMO. We comprehensively analyze and discuss their characteristics and interrelationships. Following this, we examine exact and approximate near-field channel models for XL-MIMO. Given the distinct electromagnetic properties of near-field communications, we present a range of channel models to demonstrate the benefits of XL-MIMO. We further motivate and discuss low-complexity signal processing schemes to promote the practical implementation of XL-MIMO. Furthermore, we explore the interplay between XL-MIMO and other emergent 6G technologies. Finally, we outline several compelling research directions for future XL-MIMO wireless communication systems.




Abstract:This paper studies the over-the-air computation (AirComp) in an orthogonal frequency division multiplexing (OFDM) system with imperfect channel state information (CSI), in which multiple single-antenna wireless devices (WDs) simultaneously send uncoded signals to a multi-antenna access point (AP) for distributed functional computation over multiple subcarriers. In particular, we consider two scenarios with best-effort and error-constrained computation tasks, with the objectives of minimizing the average computation mean squared error (MSE) and the computation outage probability over the multiple subcarriers, respectively. Towards this end, we jointly optimize the transmit coefficients at the WDs and the receive beamforming vectors at the AP over subcarriers, subject to the maximum transmit power constraints at individual WDs. First, for the special case with a single receive antenna at the AP, we propose the semi-closed-form globally optimal solutions to the two problems using the Lagrange-duality method. It is shown that at each subcarrier, the WDs' optimized power control policy for average MSE minimization follows a regularized channel inversion structure, while that for computation outage probability minimization follows an on-off regularized channel inversion, with the regularization dependent on the transmit power budget and channel estimation error. Next, for the general case with multiple receive antennas at the AP, we present efficient algorithms based on alternating optimization and convex optimization to find converged solutions to both problems.
Abstract:In this paper, we introduce a realistic and challenging domain adaptation problem called Universal Semi-supervised Model Adaptation (USMA), which i) requires only a pre-trained source model, ii) allows the source and target domain to have different label sets, i.e., they share a common label set and hold their own private label set, and iii) requires only a few labeled samples in each class of the target domain. To address USMA, we propose a collaborative consistency training framework that regularizes the prediction consistency between two models, i.e., a pre-trained source model and its variant pre-trained with target data only, and combines their complementary strengths to learn a more powerful model. The rationale of our framework stems from the observation that the source model performs better on common categories than the target-only model, while on target-private categories, the target-only model performs better. We also propose a two-perspective, i.e., sample-wise and class-wise, consistency regularization to improve the training. Experimental results demonstrate the effectiveness of our method on several benchmark datasets.




Abstract:This paper studies a multi-intelligent-reflecting-surface-(IRS)-enabled integrated sensing and communications (ISAC) system, in which multiple IRSs are installed to help the base station (BS) provide ISAC services at separate line-of-sight (LoS) blocked areas. We focus on the scenario with semi-passive uniform linear array (ULA) IRSsfor sensing, in which each IRS is integrated with dedicated sensors for processing echo signals, and each IRS simultaneously serves one sensing target and one communication user (CU) in its coverage area. In particular, we suppose that the BS sends combined information and dedicated sensing signals for ISAC, and we consider two cases with point and extended targets, in which each IRS aims to estimate the direction-of-arrival (DoA) of the corresponding target and the complete target response matrix, respectively. Under this setup, we first derive the closed-form Cram{\'e}r-Rao bounds (CRBs) for parameters estimation under the two target models. For the point target case, the CRB for AoA estimation is shown to be inversely proportional to the cubic of the number of sensors at each IRS, while for the extended target case, the CRB for target response matrix estimation is proportional to the number of IRS sensors. Next, we consider two different types of CU receivers that can and cannot cancel the interference from dedicated sensing signals prior to information decoding. To achieve fair and optimized sensing performance, we minimize the maximum CRB at all IRSs for the two target cases, via jointly optimizing the transmit beamformers at the BS and the reflective beamformers at the multiple IRSs, subject to the minimum signal-to-interference-plus-noise ratio (SINR) constraints at individual CUs, the maximum transmit power constraint at the BS, and the unit-modulus constraints at the multiple IRSs.
Abstract:This paper investigates the energy efficiency of a multiple-input multiple-output (MIMO) integrated sensing and communications (ISAC) system, in which one multi-antenna base station (BS) transmits unified ISAC signals to a multi-antenna communication user (CU) and at the same time use the echo signals to estimate an extended target. We focus on one particular ISAC transmission block and take into account the practical on-off non-transmission power at the BS. Under this setup, we minimize the energy consumption at the BS while ensuring a minimum average data rate requirement for communication and a maximum Cram\'er-Rao bound (CRB) requirement for target estimation, by jointly optimizing the transmit covariance matrix and the ``on'' duration for active transmission. We obtain the optimal solution to the rate-and-CRB-constrained energy minimization problem in a semi-closed form. Interestingly, the obtained optimal solution is shown to unify the spectrum-efficient and energy-efficient communications and sensing designs. In particular, for the special MIMO sensing case with rate constraint inactive, the optimal solution follows the isotropic transmission with shortest ``on'' duration, in which the BS radiates the required sensing energy by using sufficiently high power over the shortest duration. For the general ISAC case, the optimal transmit covariance solution is of full rank and follows the eigenmode transmission based on the communication channel, while the optimal ``on'' duration is determined based on both the rate and CRB constraints. Numerical results show that the proposed ISAC design achieves significantly reduced energy consumption as compared to the benchmark schemes based on isotropic transmission, always-on transmission, and sensing or communications only designs, especially when the rate and CRB constraints become stringent.




Abstract:High-quality data is essential for conversational recommendation systems and serves as the cornerstone of the network architecture development and training strategy design. Existing works contribute heavy human efforts to manually labeling or designing and extending recommender dialogue templates. However, they suffer from (i) the limited number of human annotators results in that datasets can hardly capture rich and large-scale cases in the real world, (ii) the limited experience and knowledge of annotators account for the uninformative corpus and inappropriate recommendations. In this paper, we propose a novel automatic dataset synthesis approach that can generate both large-scale and high-quality recommendation dialogues through a data2text generation process, where unstructured recommendation conversations are generated from structured graphs based on user-item information from the real world. In doing so, we comprehensively exploit: (i) rich personalized user profiles from traditional recommendation datasets, (ii) rich external knowledge from knowledge graphs, and (iii) the conversation ability contained in human-to-human conversational recommendation datasets. Extensive experiments validate the benefit brought by the automatically synthesized data under low-resource scenarios and demonstrate the promising potential to facilitate the development of a more effective conversational recommendation system.




Abstract:Accurate polyp detection is essential for assisting clinical rectal cancer diagnoses. Colonoscopy videos contain richer information than still images, making them a valuable resource for deep learning methods. Great efforts have been made to conduct video polyp detection through multi-frame temporal/spatial aggregation. However, unlike common fixed-camera video, the camera-moving scene in colonoscopy videos can cause rapid video jitters, leading to unstable training for existing video detection models. Additionally, the concealed nature of some polyps and the complex background environment further hinder the performance of existing video detectors. In this paper, we propose the \textbf{YONA} (\textbf{Y}ou \textbf{O}nly \textbf{N}eed one \textbf{A}djacent Reference-frame) method, an efficient end-to-end training framework for video polyp detection. YONA fully exploits the information of one previous adjacent frame and conducts polyp detection on the current frame without multi-frame collaborations. Specifically, for the foreground, YONA adaptively aligns the current frame's channel activation patterns with its adjacent reference frames according to their foreground similarity. For the background, YONA conducts background dynamic alignment guided by inter-frame difference to eliminate the invalid features produced by drastic spatial jitters. Moreover, YONA applies cross-frame contrastive learning during training, leveraging the ground truth bounding box to improve the model's perception of polyp and background. Quantitative and qualitative experiments on three public challenging benchmarks demonstrate that our proposed YONA outperforms previous state-of-the-art competitors by a large margin in both accuracy and speed.