Noteworthy strides continue to be made in the development of full-duplex millimeter wave (mmWave) communication systems, but most of this progress has been built on theoretical models and validated through simulation. In this work, we conduct a long overdue real-world evaluation of full-duplex mmWave systems using off-the-shelf 60 GHz phased arrays. Using an experimental full-duplex base station, we collect over 200,000 measurements of self-interference by electronically sweeping its transmit and receive beams across a dense spatial profile, shedding light on the effects of the environment, array positioning, and beam steering direction. We then call attention to five key challenges faced by practical full-duplex mmWave systems and, with these in mind, propose a general framework for beamforming-based full-duplex solutions. Guided by this framework, we introduce a novel solution called STEER+, a more robust version of recent work called STEER, and experimentally evaluate both in a real-world setting with actual downlink and uplink users. Rather than purely minimize self-interference as with STEER, STEER+ makes use of additional measurements to maximize spectral efficiency, which proves to make it much less sensitive to one's choice of design parameters. We experimentally show that STEER+ can reliably reduce self-interference to near or below the noise floor while maintaining high SNR on the downlink and uplink, thus enabling full-duplex operation purely via beamforming.
This work proposes a novel face-swapping framework FlowFace++, utilizing explicit semantic flow supervision and end-to-end architecture to facilitate shape-aware face-swapping. Specifically, our work pretrains a facial shape discriminator to supervise the face swapping network. The discriminator is shape-aware and relies on a semantic flow-guided operation to explicitly calculate the shape discrepancies between the target and source faces, thus optimizing the face swapping network to generate highly realistic results. The face swapping network is a stack of a pre-trained face-masked autoencoder (MAE), a cross-attention fusion module, and a convolutional decoder. The MAE provides a fine-grained facial image representation space, which is unified for the target and source faces and thus facilitates final realistic results. The cross-attention fusion module carries out the source-to-target face swapping in a fine-grained latent space while preserving other attributes of the target image (e.g. expression, head pose, hair, background, illumination, etc). Lastly, the convolutional decoder further synthesizes the swapping results according to the face-swapping latent embedding from the cross-attention fusion module. Extensive quantitative and qualitative experiments on in-the-wild faces demonstrate that our FlowFace++ outperforms the state-of-the-art significantly, particularly while the source face is obstructed by uneven lighting or angle offset.
Instead of relying on human-annotated training samples to build a classifier, weakly supervised scientific paper classification aims to classify papers only using category descriptions (e.g., category names, category-indicative keywords). Existing studies on weakly supervised paper classification are less concerned with two challenges: (1) Papers should be classified into not only coarse-grained research topics but also fine-grained themes, and potentially into multiple themes, given a large and fine-grained label space; and (2) full text should be utilized to complement the paper title and abstract for classification. Moreover, instead of viewing the entire paper as a long linear sequence, one should exploit the structural information such as citation links across papers and the hierarchy of sections and paragraphs in each paper. To tackle these challenges, in this study, we propose FUTEX, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision. A network-aware contrastive fine-tuning module and a hierarchy-aware aggregation module are designed to leverage the two types of structural signals, respectively. Experiments on two benchmark datasets demonstrate that FUTEX significantly outperforms competitive baselines and is on par with fully supervised classifiers that use 1,000 to 60,000 ground-truth training samples.
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples
Robots are increasingly being deployed not only in workplaces but also in households. Effectively execute of manipulation tasks by robots relies on variable impedance control with contact forces. Furthermore, robots should possess adaptive capabilities to handle the considerable variations exhibited by different robotic tasks in dynamic environments, which can be obtained through human demonstrations. This paper presents a learning-from-demonstration framework that integrates force sensing and motion information to facilitate variable impedance control. The proposed approach involves the estimation of full stiffness matrices from human demonstrations, which are then combined with sensed forces and motion information to create a model using the non-parametric method. This model allows the robot to replicate the demonstrated task while also responding appropriately to new task conditions through the use of the state-dependent stiffness profile. Additionally, a novel tank based variable impedance control approach is proposed to ensure passivity by using the learned stiffness. The proposed approach was evaluated using two virtual variable stiffness systems. The first evaluation demonstrates that the stiffness estimated approach exhibits superior robustness compared to traditional methods when tested on manual datasets, and the second evaluation illustrates that the novel tank based approach is more easily implementable compared to traditional variable impedance control approaches.
Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach called Two Parallel Adapter (TPA) that is inserted into the conformer-based model pre-trained model instead. TPA is based on systematic studies of the residual adapter, a popular approach for finetuning a subset of parameters. We evaluate TPA on various public benchmarks and experiment results demonstrates its superior performance, which is close to the full finetuning on different datasets and speech tasks. These results show that TPA is an effective and efficient approach for serving large pre-trained speech models. Ablation studies show that TPA can also be pruned, especially for lower blocks.
Personalized dialogue agents (DAs) powered by large pre-trained language models (PLMs) often rely on explicit persona descriptions to maintain personality consistency. However, such descriptions may not always be available or may pose privacy concerns. To tackle this bottleneck, we introduce PersonaPKT, a lightweight transfer learning approach that can build persona-consistent dialogue models without explicit persona descriptions. By representing each persona as a continuous vector, PersonaPKT learns implicit persona-specific features directly from a small number of dialogue samples produced by the same persona, adding less than 0.1% trainable parameters for each persona on top of the PLM backbone. Empirical results demonstrate that PersonaPKT effectively builds personalized DAs with high storage efficiency, outperforming various baselines in terms of persona consistency while maintaining good response generation quality. In addition, it enhances privacy protection by avoiding explicit persona descriptions. Overall, PersonaPKT is an effective solution for creating personalized DAs that respect user privacy.
Learning and analysis of network robustness, including controllability robustness and connectivity robustness, is critical for various networked systems against attacks. Traditionally, network robustness is determined by attack simulations, which is very time-consuming and even incapable for large-scale networks. Network Robustness Learning, which is dedicated to learning network robustness with high precision and high speed, provides a powerful tool to analyze network robustness by replacing simulations. In this paper, a novel versatile and unified robustness learning approach via graph transformer (NRL-GT) is proposed, which accomplishes the task of controllability robustness learning and connectivity robustness learning from multiple aspects including robustness curve learning, overall robustness learning, and synthetic network classification. Numerous experiments show that: 1) NRL-GT is a unified learning framework for controllability robustness and connectivity robustness, demonstrating a strong generalization ability to ensure high precision when training and test sets are distributed differently; 2) Compared to the cutting-edge methods, NRL-GT can simultaneously perform network robustness learning from multiple aspects and obtains superior results in less time. NRL-GT is also able to deal with complex networks of different size with low learning error and high efficiency; 3) It is worth mentioning that the backbone of NRL-GT can serve as a transferable feature learning module for complex networks of different size and different downstream tasks.
Few-shot question answering (QA) aims at precisely discovering answers to a set of questions from context passages while only a few training samples are available. Although existing studies have made some progress and can usually achieve proper results, they suffer from understanding deep semantics for reasoning out the questions. In this paper, we develop Gotta, a Generative prOmpT-based daTa Augmentation framework to mitigate the challenge above. Inspired by the human reasoning process, we propose to integrate the cloze task to enhance few-shot QA learning. Following the recent success of prompt-tuning, we present the cloze task in the same format as the main QA task, allowing the model to learn both tasks seamlessly together to fully take advantage of the power of prompt-tuning. Extensive experiments on widely used benchmarks demonstrate that Gotta consistently outperforms competitive baselines, validating the effectiveness of our proposed prompt-tuning-based cloze task, which not only fine-tunes language models but also learns to guide reasoning in QA tasks. Further analysis shows that the prompt-based loss incorporates the auxiliary task better than the multi-task loss, highlighting the strength of prompt-tuning on the few-shot QA task.