Data-driven methods have great advantages in modeling complicated human behavioral dynamics and dealing with many human-robot interaction applications. However, collecting massive and annotated real-world human datasets has been a laborious task, especially for highly interactive scenarios. On the other hand, algorithmic data generation methods are usually limited by their model capacities, making them unable to offer realistic and diverse data needed by various application users. In this work, we study trajectory-level data generation for multi-human or human-robot interaction scenarios and propose a learning-based automatic trajectory generation model, which we call Multi-Agent TRajectory generation with dIverse conteXts (MATRIX). MATRIX is capable of generating interactive human behaviors in realistic diverse contexts. We achieve this goal by modeling the explicit and interpretable objectives so that MATRIX can generate human motions based on diverse destinations and heterogeneous behaviors. We carried out extensive comparison and ablation studies to illustrate the effectiveness of our approach across various metrics. We also presented experiments that demonstrate the capability of MATRIX to serve as data augmentation for imitation-based motion planning.
In this work, we propose Mel-FullSubNet, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. Mel-FullSubNet takes as input the noisy and reverberant Mel-spectrogram and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. Mel-FullSubNet encapsulates interleaved full-band and sub-band networks, for learning the full-band spectral pattern of signals and the sub-band/narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the major advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model.
Early infancy is a rapid and dynamic neurodevelopmental period for behavior and neurocognition. Longitudinal magnetic resonance imaging (MRI) is an effective tool to investigate such a crucial stage by capturing the developmental trajectories of the brain structures. However, longitudinal MRI acquisition always meets a serious data-missing problem due to participant dropout and failed scans, making longitudinal infant brain atlas construction and developmental trajectory delineation quite challenging. Thanks to the development of an AI-based generative model, neuroimage completion has become a powerful technique to retain as much available data as possible. However, current image completion methods usually suffer from inconsistency within each individual subject in the time dimension, compromising the overall quality. To solve this problem, our paper proposed a two-stage cascaded diffusion model, Cas-DiffCom, for dense and longitudinal 3D infant brain MRI completion and super-resolution. We applied our proposed method to the Baby Connectome Project (BCP) dataset. The experiment results validate that Cas-DiffCom achieves both individual consistency and high fidelity in longitudinal infant brain image completion. We further applied the generated infant brain images to two downstream tasks, brain tissue segmentation and developmental trajectory delineation, to declare its task-oriented potential in the neuroscience field.
By integrating the local voltage-controlled magnetic anisotropy (VCMA) effect, Dzyaloshinskii-Moriya interaction (DMI) effect, and spin-orbit torque (SOT) effect, we propose a novel device structure for field-free magnetic tunnel junction (MTJ). Micromagnetic simulation shows that the device utilizes the chiral symmetry breaking caused by the DMI effect to induce a non-collinear spin texture under the influence of SOT current. This, combined with the perpendicular magnetic anisotropy (PMA) gradient generated by the local VCMA effect, enables deterministic switching of the MTJ state without an external field. The impact of variations in DMI strength and PMA gradient on the magnetization dynamics is analyzed.
Fourier ptychography (FP) imaging, drawing on the idea of synthetic aperture, has been demonstrated as a potential approach for remote sub-diffraction-limited imaging. Nevertheless, the farthest imaging distance is still limited around 10 m even though there has been a significant improvement in macroscopic FP. The most severely issue in increasing the imaging distance is FoV limitation caused by far-field condition for diffraction. Here, we propose to modify the Fourier far-field condition for rough reflective objects, aiming to overcome the small FoV limitation by using a divergent beam to illuminate objects. A joint optimization of pupil function and target image is utilized to attain the aberration-free image while estimating the pupil function simultaneously. Benefiting from the optimized reconstruction algorithm which effectively expands the camera's effective aperture, we experimentally implement several FP systems suited for imaging distance of 12 m, 90 m, and 170 m with the maximum synthetic aperture of 200 mm. The maximum imaging distance and synthetic aperture are thus improved by more than one order of magnitude of the state-of-the-art works with a fourfold improvement in the resolution. Our findings demonstrate significant potential for advancing the field of macroscopic FP, propelling it into a new stage of development.
We tackle the network topology inference problem by utilizing Laplacian constrained Gaussian graphical models, which recast the task as estimating a precision matrix in the form of a graph Laplacian. Recent research \cite{ying2020nonconvex} has uncovered the limitations of the widely used $\ell_1$-norm in learning sparse graphs under this model: empirically, the number of nonzero entries in the solution grows with the regularization parameter of the $\ell_1$-norm; theoretically, a large regularization parameter leads to a fully connected (densest) graph. To overcome these challenges, we propose a graph Laplacian estimation method incorporating the $\ell_0$-norm constraint. An efficient gradient projection algorithm is developed to solve the resulting optimization problem, characterized by sparsity and Laplacian constraints. Through numerical experiments with synthetic and financial time-series datasets, we demonstrate the effectiveness of the proposed method in network topology inference.
Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverags internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogenous FL.
Although lane detection methods have shown impressive performance in real-world scenarios, most of methods require post-processing which is not robust enough. Therefore, end-to-end detectors like DEtection TRansformer(DETR) have been introduced in lane detection.However, one-to-one label assignment in DETR can degrade the training efficiency due to label semantic conflicts. Besides, positional query in DETR is unable to provide explicit positional prior, making it difficult to be optimized. In this paper, we present the One-to-Several Transformer(O2SFormer). We first propose the one-to-several label assignment, which combines one-to-many and one-to-one label assignment to solve label semantic conflicts while keeping end-to-end detection. To overcome the difficulty in optimizing one-to-one assignment. We further propose the layer-wise soft label which dynamically adjusts the positive weight of positive lane anchors in different decoder layers. Finally, we design the dynamic anchor-based positional query to explore positional prior by incorporating lane anchors into positional query. Experimental results show that O2SFormer with ResNet50 backbone achieves 77.83% F1 score on CULane dataset, outperforming existing Transformer-based and CNN-based detectors. Futhermore, O2SFormer converges 12.5x faster than DETR for the ResNet18 backbone.
Although lane detection methods have shown impressive performance in real-world scenarios, most of methods require post-processing which is not robust enough. Therefore, end-to-end detectors like DEtection TRansformer(DETR) have been introduced in lane detection. However, one-to-one label assignment in DETR can degrade the training efficiency due to label semantic conflicts. Besides, positional query in DETR is unable to provide explicit positional prior, making it difficult to be optimized. In this paper, we present the One-to-Several Transformer(O2SFormer). We first propose the one-to-several label assignment, which combines one-to-one and one-to-many label assignments to improve the training efficiency while keeping end-to-end detection. To overcome the difficulty in optimizing one-to-one assignment. We further propose the layer-wise soft label which adjusts the positive weight of positive lane anchors across different decoder layers. Finally, we design the dynamic anchor-based positional query to explore positional prior by incorporating lane anchors into positional query. Experimental results show that O2SFormer significantly speeds up the convergence of DETR and outperforms Transformer-based and CNN-based detectors on the CULane dataset. Code will be available at https://github.com/zkyseu/O2SFormer.
Fusing a sequence of perfectly aligned images captured at various exposures, has shown great potential to approach High Dynamic Range (HDR) imaging by sensors with limited dynamic range. However, in the presence of large motion of scene objects or the camera, mis-alignment is almost inevitable and leads to the notorious ``ghost'' artifacts. Besides, factors such as the noise in the dark region or color saturation in the over-bright region may also fail to fill local image details to the HDR image. This paper provides a novel multi-exposure fusion model based on Swin Transformer. Particularly, we design feature selection gates, which are integrated with the feature extraction layers to detect outliers and block them from HDR image synthesis. To reconstruct the missing local details by well-aligned and properly-exposed regions, we exploit the long distance contextual dependency in the exposure-space pyramid by the self-attention mechanism. Extensive numerical and visual evaluation has been conducted on a variety of benchmark datasets. The experiments show that our model achieves the accuracy on par with current top performing multi-exposure HDR imaging models, while gaining higher efficiency.