We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people suffering from hearing loss. Therefore, we eliminate the need of using text as input and design techniques that work for more natural, continuous, freely uttered speech covering an extensive vocabulary. Since the current datasets are inadequate for generating sign language directly from speech, we collect and release the first Indian sign language dataset comprising speech-level annotations, text transcripts, and the corresponding sign-language videos. Next, we propose a multi-tasking transformer network trained to generate signer's poses from speech segments. With speech-to-text as an auxiliary task and an additional cross-modal discriminator, our model learns to generate continuous sign pose sequences in an end-to-end manner. Extensive experiments and comparisons with other baselines demonstrate the effectiveness of our approach. We also conduct additional ablation studies to analyze the effect of different modules of our network. A demo video containing several results is attached to the supplementary material.
Graphs are a common model for complex relational data such as social networks and protein interactions, and such data can evolve over time (e.g., new friendships) and be noisy (e.g., unmeasured interactions). Link prediction aims to predict future edges or infer missing edges in the graph, and has diverse applications in recommender systems, experimental design, and complex systems. Even though link prediction algorithms strongly depend on the set of edges in the graph, existing approaches typically do not modify the graph topology to improve performance. Here, we demonstrate how simply adding a set of edges, which we call a \emph{proposal set}, to the graph as a pre-processing step can improve the performance of several link prediction algorithms. The underlying idea is that if the edges in the proposal set generally align with the structure of the graph, link prediction algorithms are further guided towards predicting the right edges; in other words, adding a proposal set of edges is a signal-boosting pre-processing step. We show how to use existing link prediction algorithms to generate effective proposal sets and evaluate this approach on various synthetic and empirical datasets. We find that proposal sets meaningfully improve the accuracy of link prediction algorithms based on both neighborhood heuristics and graph neural networks. Code is available at \url{https://github.com/CUAI/Edge-Proposal-Sets}.
In frequency division duplex (FDD) multiple-input multiple-output (MIMO) wireless communications, limited channel state information (CSI) feedback is a central tool to support advanced single- and multi-user MIMO beamforming/precoding. To achieve a given CSI quality, the CSI quantization codebook size has to grow exponentially with the number of antennas, leading to quantization complexity, as well as, feedback overhead issues for larger MIMO systems. We have recently proposed a multi-stage recursive Grassmannian quantizer that enables a significant complexity reduction of CSI quantization. In this paper, we show that this recursive quantizer can effectively be combined with deep learning classification to further reduce the complexity, and that it can exploit temporal channel correlations to reduce the CSI feedback overhead.
Disseminating accurate travel time information to road users helps achieve traffic equilibrium and reduce traffic congestion. The deployment of Connected Vehicles technology will provide unique opportunities for the implementation of travel time prediction models. The aim of this study is twofold: (1) estimate travel times in the freeway network at five-minute intervals using Basic Safety Messages (BSM); (2) develop an eXtreme Gradient Boosting (XGB) model for short-term travel time prediction on freeways. The XGB tree-based ensemble prediction model is evaluated against common tree-based ensemble algorithms and the evaluations are performed at five-minute intervals over a 30-minute horizon. BSMs generated by the Safety Pilot Model Deployment conducted in Ann Arbor, Michigan, were used. Nearly two billion messages were processed for providing travel time estimates for the entire freeway network. A Combination of grid search and five-fold cross-validation techniques using the travel time estimates were used for developing the prediction models and tuning their parameters. About 9.6 km freeway stretch was used for evaluating the XGB together with the most common tree-based ensemble algorithms. The results show that XGB is superior to all other algorithms, followed by the Gradient Boosting. XGB travel time predictions were accurate and consistent with variations during peak periods, with mean absolute percentage error in prediction about 5.9% and 7.8% for 5-minute and 30-minute horizons, respectively. Additionally, through applying the developed models to another 4.7 km stretch along the eastbound segment of M-14, the XGB demonstrated its considerable advantages in travel time prediction during congested and uncongested conditions.
We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration, without requiring any prior knowledge of the object being interacted with. Our method models imitation learning as a state estimation problem, with the state defined as the end-effector's pose at the point where object interaction begins, as observed from the demonstration. By then modelling a manipulation task as a coarse, approach trajectory followed by a fine, interaction trajectory, this state estimator can be trained in a self-supervised manner, by automatically moving the end-effector's camera around the object. At test time, the end-effector moves to the estimated state through a linear path, at which point the original demonstration's end-effector velocities are simply replayed. This enables convenient acquisition of a complex interaction trajectory, without actually needing to explicitly learn a policy. Real-world experiments on 8 everyday tasks show that our method can learn a diverse range of skills from a single human demonstration, whilst also yielding a stable and interpretable controller.
The quadratic computational and memory complexities of the Transformer's attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety
By exploiting the superiority of non-orthogonal multiple access (NOMA), NOMA-aided mobile edge computing (MEC) can provide scalable and low-latency computing services for the Internet of Things. However, given the prevalent stochasticity of wireless networks and sophisticated signal processing of NOMA, it is critical but challenging to design an efficient task offloading algorithm for NOMA-aided MEC, especially under a large number of devices. This paper presents an online algorithm that jointly optimizes offloading decisions and resource allocation to maximize the long-term system utility (i.e., a measure of throughput and fairness). Since the optimization variables are temporary coupled, we first apply Lyapunov technique to decouple the long-term stochastic optimization into a series of per-slot deterministic subproblems, which does not require any prior knowledge of network dynamics. Second, we propose to transform the non-convex per-slot subproblem of optimizing NOMA power allocation equivalently to a convex form by introducing a set of auxiliary variables, whereby the time-complexity is reduced from the exponential complexity to $\mathcal{O} (M^{3/2})$. The proposed algorithm is proved to be asymptotically optimal, even under partial knowledge of the device states at the base station. Simulation results validate the superiority of the proposed algorithm in terms of system utility, stability improvement, and the overhead reduction.
An event happening in the world is often made of different activities and actions that can unfold simultaneously or sequentially within a few seconds. However, most large-scale datasets built to train models for action recognition provide a single label per video clip. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information that would be mandatory to more completely comprehend different events and eventually learn causality between them. Towards this goal, we augmented the existing video dataset, Moments in Time (MiT), to include over two million action labels for over one million three second videos. This multi-label dataset introduces novel challenges on how to train and analyze models for multi-action detection. Here, we present baseline results for multi-action recognition using loss functions adapted for long tail multi-label learning and provide improved methods for visualizing and interpreting models trained for multi-label action detection.
Open-domain neural dialogue models have achieved high performance in response ranking and evaluation tasks. These tasks are formulated as a binary classification of responses given in a dialogue context, and models generally learn to make predictions based on context-response content similarity. However, over-reliance on content similarity makes the models less sensitive to the presence of inconsistencies, incorrect time expressions and other factors important for response appropriateness and coherence. We propose approaches for automatically creating adversarial negative training data to help ranking and evaluation models learn features beyond content similarity. We propose mask-and-fill and keyword-guided approaches that generate negative examples for training more robust dialogue systems. These generated adversarial responses have high content similarity with the contexts but are either incoherent, inappropriate or not fluent. Our approaches are fully data-driven and can be easily incorporated in existing models and datasets. Experiments on classification, ranking and evaluation tasks across multiple datasets demonstrate that our approaches outperform strong baselines in providing informative negative examples for training dialogue systems.
Marginal structural models (MSMs) estimate the causal effect of a time-varying treatment in the presence of time-dependent confounding via weighted regression. The standard approach of using inverse probability of treatment weighting (IPTW) can lead to high-variance estimates due to extreme weights and be sensitive to model misspecification. Various methods have been proposed to partially address this, including truncation and stabilized-IPTW to temper extreme weights and covariate balancing propensity score (CBPS) to address treatment model misspecification. In this paper, we present Kernel Optimal Weighting (KOW), a convex-optimization-based approach that finds weights for fitting the MSM that optimally balance time-dependent confounders while simultaneously controlling for precision, directly addressing the above limitations. KOW directly minimizes the error in estimation due to time-dependent confounding via a new decomposition as a functional. We further extend KOW to control for informative censoring. We evaluate the performance of KOW in a simulation study, comparing it with IPTW, stabilized-IPTW, and CBPS. We demonstrate the use of KOW in studying the effect of treatment initiation on time-to-death among people living with HIV and the effect of negative advertising on elections in the United States.