We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries. Compared to traditional text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice, natural facial expression and body gestures). Given a speech request, the demonstration is able to response with high quality videos in sub-second latency. To deliver immersive user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video generation. Backed with large knowledge base, ViDA-MAN is able to chat with users on a number of topics including chit-chat, weather, device control, News recommendations, booking hotels, as well as answering questions via structured knowledge.
Gradient-based Bi-Level Optimization (BLO) methods have been widely applied to solve modern machine learning problems. However, most existing solution strategies are theoretically designed based on restrictive assumptions (e.g., convexity of the lower-level sub-problem), and computationally not applicable for high-dimensional tasks. Moreover, there are almost no gradient-based methods that can efficiently handle BLO in those challenging scenarios, such as BLO with functional constraints and pessimistic BLO. In this work, by reformulating BLO into an approximated single-level problem based on the value-function, we provide a new method, named Bi-level Value-Function-based Sequential Minimization (BVFSM), to partially address the above issues. To be specific, BVFSM constructs a series of value-function-based approximations, and thus successfully avoids the repeated calculations of recurrent gradient and Hessian inverse required by existing approaches, which are time-consuming (especially for high-dimensional tasks). We also extend BVFSM to address BLO with additional upper- and lower-level functional constraints. More importantly, we demonstrate that the algorithmic framework of BVFSM can also be used for the challenging pessimistic BLO, which has never been properly solved by existing gradient-based methods. On the theoretical side, we strictly prove the convergence of BVFSM on these types of BLO, in which the restrictive lower-level convexity assumption is completely discarded. To our best knowledge, this is the first gradient-based algorithm that can solve different kinds of BLO problems (e.g., optimistic, pessimistic and with constraints) all with solid convergence guarantees. Extensive experiments verify our theoretical investigations and demonstrate the superiority of BVFSM on various real-world applications.
In recent years, Bi-Level Optimization (BLO) techniques have received extensive attentions from both learning and vision communities. A variety of BLO models in complex and practical tasks are of non-convex follower structure in nature (a.k.a., without Lower-Level Convexity, LLC for short). However, this challenging class of BLOs is lack of developments on both efficient solution strategies and solid theoretical guarantees. In this work, we propose a new algorithmic framework, named Initialization Auxiliary and Pessimistic Trajectory Truncated Gradient Method (IAPTT-GM), to partially address the above issues. In particular, by introducing an auxiliary as initialization to guide the optimization dynamics and designing a pessimistic trajectory truncation operation, we construct a reliable approximate version of the original BLO in the absence of LLC hypothesis. Our theoretical investigations establish the convergence of solutions returned by IAPTT-GM towards those of the original BLO without LLC. As an additional bonus, we also theoretically justify the quality of our IAPTT-GM embedded with Nesterov's accelerated dynamics under LLC. The experimental results confirm both the convergence of our algorithm without LLC, and the theoretical findings under LLC.
Reconfigurable Intelligent Surface (RIS) is a revolutionizing approach to provide cost-effective yet energy-efficient communications. The transmit beamforming of the base station (BS) and discrete phase shifts of the RIS are jointly optimized to provide high quality of service. However, existing works ignore the high dependence between the large number of phase shifts and estimate them separately, consequently, easily getting trapped into local optima. To investigate the number and distribution of local optima, we conduct a fitness landscape analysis on the sum rate maximization problems. Two landscape features, the fitness distribution correlation and autocorrelation, are employed to investigate the ruggedness of landscape. The investigation results indicate that the landscape exhibits a rugged, multi-modal structure, i.e., has many local peaks, particularly in the cases with large-scale RISs. To handle the multi-modal landscape structure, we propose a novel niching genetic algorithm to solve the sum rate maximization problem. Particularly, a niching technique, nearest-better clustering, is incorporated to partition the population into several neighborhood species, thereby locating multiple local optima and enhance the global search ability. We also present a minimum species size to further improve the convergence speed. Simulation results demonstrate that our method achieves significant capacity gains compared to existing algorithms, particularly in the cases with large-scale RISs.
Empirical mode decomposition (EMD) has developed into a prominent tool for adaptive, scale-based signal analysis in various fields like robotics, security and biomedical engineering. Since the dramatic increase in amount of data puts forward higher requirements for the capability of real-time signal analysis, it is difficult for existing EMD and its variants to trade off the growth of data dimension and the speed of signal analysis. In order to decompose multi-dimensional signals at a faster speed, we present a novel signal-serialization method (serial-EMD), which concatenates multi-variate or multi-dimensional signals into a one-dimensional signal and uses various one-dimensional EMD algorithms to decompose it. To verify the effects of the proposed method, synthetic multi-variate time series, artificial 2D images with various textures and real-world facial images are tested. Compared with existing multi-EMD algorithms, the decomposition time becomes significantly reduced. In addition, the results of facial recognition with Intrinsic Mode Functions (IMFs) extracted using our method can achieve a higher accuracy than those obtained by existing multi-EMD algorithms, which demonstrates the superior performance of our method in terms of the quality of IMFs. Furthermore, this method can provide a new perspective to optimize the existing EMD algorithms, that is, transforming the structure of the input signal rather than being constrained by developing envelope computation techniques or signal decomposition methods. In summary, the study suggests that the serial-EMD technique is a highly competitive and fast alternative for multi-dimensional signal analysis.
Bi-level optimization model is able to capture a wide range of complex learning tasks with practical interest. Due to the witnessed efficiency in solving bi-level programs, gradient-based methods have gained popularity in the machine learning community. In this work, we propose a new gradient-based solution scheme, namely, the Bi-level Value-Function-based Interior-point Method (BVFIM). Following the main idea of the log-barrier interior-point scheme, we penalize the regularized value function of the lower level problem into the upper level objective. By further solving a sequence of differentiable unconstrained approximation problems, we consequently derive a sequential programming scheme. The numerical advantage of our scheme relies on the fact that, when gradient methods are applied to solve the approximation problem, we successfully avoid computing any expensive Hessian-vector or Jacobian-vector product. We prove the convergence without requiring any convexity assumption on either the upper level or the lower level objective. Experiments demonstrate the efficiency of the proposed BVFIM on non-convex bi-level problems.
Gridless methods show great superiority in line spectral estimation. These methods need to solve an atomic $l_0$ norm (i.e., the continuous analog of $l_0$ norm) minimization problem to estimate frequencies and model order. Since this problem is NP-hard to compute, relaxations of atomic $l_0$ norm, such as nuclear norm and reweighted atomic norm, have been employed for promoting sparsity. However, the relaxations give rise to a resolution limit, subsequently leading to biased model order and convergence error. To overcome the above shortcomings of relaxation, we propose a novel idea of simultaneously estimating the frequencies and model order by means of the atomic $l_0$ norm. To accomplish this idea, we build a multiobjective optimization model. The measurment error and the atomic $l_0$ norm are taken as the two optimization objectives. The proposed model directly exploits the model order via the atomic $l_0$ norm, thus breaking the resolution limit. We further design a variable-length evolutionary algorithm to solve the proposed model, which includes two innovations. One is a variable-length coding and search strategy. It flexibly codes and interactively searches diverse solutions with different model orders. These solutions act as steppingstones that help fully exploring the variable and open-ended frequency search space and provide extensive potentials towards the optima. Another innovation is a model order pruning mechanism, which heuristically prunes less contributive frequencies within the solutions, thus significantly enhancing convergence and diversity. Simulation results confirm the superiority of our approach in both frequency estimation and model order selection.
The source number identification is an essential step in direction-of-arrival (DOA) estimation. Existing methods may provide a wrong source number due to inferior statistical properties (in low SNR or limited snapshots) or modeling errors (caused by relaxing sparse penalties), especially in impulsive noise. To address this issue, we propose a novel idea of simultaneous source number identification and DOA estimation. We formulate a multiobjective off-grid DOA estimation model to realize this idea, by which the source number can be automatically identified together with DOA estimation. In particular, the source number is properly exploited by the $l_0$ norm of impinging signals without relaxations, guaranteeing accuracy. Furthermore, we design a multiobjective bilevel evolutionary algorithm to solve the proposed model. The source number identification and sparse recovery are simultaneously optimized at the on-grid (lower) level. A forward search strategy is developed to further refine the grid at the off-grid (upper) level. This strategy does not need linear approximations and can eliminate the off-grid gap with low computational complexity. Simulation results demonstrate the outperformance of our method in terms of source number and root mean square error.
Goal-conditioned hierarchical reinforcement learning (HRL) serves as a successful approach to solving complex and temporally extended tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. However, online subgoal representation learning exacerbates the non-stationary issue of HRL and introduces challenges for exploration in high-level policy learning. In this paper, we propose a state-specific regularization that stabilizes subgoal embeddings in well-explored areas while allowing representation updates in less explored state regions. Benefiting from this stable representation, we design measures of novelty and potential for subgoals, and develop an efficient hierarchical exploration strategy that actively seeks out new promising subgoals and states. Experimental results show that our method significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards and further demonstrate the stability and efficiency of the subgoal representation learning of this work, which promotes superior policy learning.
The paper proposes a new text recognition network for scene-text images. Many state-of-the-art methods employ the attention mechanism either in the text encoder or decoder for the text alignment. Although the encoder-based attention yields promising results, these schemes inherit noticeable limitations. They perform the feature extraction (FE) and visual attention (VA) sequentially, which bounds the attention mechanism to rely only on the FE final single-scale output. Moreover, the utilization of the attention process is limited by only applying it directly to the single scale feature-maps. To address these issues, we propose a new multi-scale and encoder-based attention network for text recognition that performs the multi-scale FE and VA in parallel. The multi-scale channels also undergo regular fusion with each other to develop the coordinated knowledge together. Quantitative evaluation and robustness analysis on the standard benchmarks demonstrate that the proposed network outperforms the state-of-the-art in most cases.