In this work, we investigate into the performance of mainstream neural generative models on the very task of swapping faces. We have experimented on CVAE, CGAN, CVAE-GAN, and conditioned diffusion models. Existing finely trained models have already managed to produce fake faces (Facke) indistinguishable to the naked eye as well as achieve high objective metrics. We perform a comparison among them and analyze their pros and cons. Furthermore, we proposed some promising tricks though they do not apply to this task.
The previous deep video compression approaches only use the single scale motion compensation strategy and rarely adopt the mode prediction technique from the traditional standards like H.264/H.265 for both motion and residual compression. In this work, we first propose a coarse-to-fine (C2F) deep video compression framework for better motion compensation, in which we perform motion estimation, compression and compensation twice in a coarse to fine manner. Our C2F framework can achieve better motion compensation results without significantly increasing bit costs. Observing hyperprior information (i.e., the mean and variance values) from the hyperprior networks contains discriminant statistical information of different patches, we also propose two efficient hyperprior-guided mode prediction methods. Specifically, using hyperprior information as the input, we propose two mode prediction networks to respectively predict the optimal block resolutions for better motion coding and decide whether to skip residual information from each block for better residual coding without introducing additional bit cost while bringing negligible extra computation cost. Comprehensive experimental results demonstrate our proposed C2F video compression framework equipped with the new hyperprior-guided mode prediction methods achieves the state-of-the-art performance on HEVC, UVG and MCL-JCV datasets.
In this paper, we address the problem of joint sensing, computation, and communication (SC$^{2}$) resource allocation for federated edge learning (FEEL) via a concrete case study of human motion recognition based on wireless sensing in ambient intelligence. First, by analyzing the wireless sensing process in human motion recognition, we find that there exists a thresholding value for the sensing transmit power, exceeding which yields sensing data samples with approximately the same satisfactory quality. Then, the joint SC$^{2}$ resource allocation problem is cast to maximize the convergence speed of FEEL, under the constraints on training time, energy supply, and sensing quality of each edge device. Solving this problem entails solving two subproblems in order: the first one reduces to determine the joint sensing and communication resource allocation that maximizes the total number of samples that can be sensed during the entire training process; the second one concerns the partition of the attained total number of sensed samples over all the communication rounds to determine the batch size at each round for convergence speed maximization. The first subproblem on joint sensing and communication resource allocation is converted to a single-variable optimization problem by exploiting the derived relation between different control variables (resources), which thus allows an efficient solution via one-dimensional grid search. For the second subproblem, it is found that the number of samples to be sensed (or batch size) at each round is a decreasing function of the loss function value attained at the round. Based on this relationship, the approximate optimal batch size at each communication round is derived in closed-form as a function of the round index. Finally, extensive simulation results are provided to validate the superiority of the proposed joint SC$^{2}$ resource allocation scheme.
In this paper, we investigate an online prediction strategy named as Discounted-Normal-Predictor (Kapralov and Panigrahy, 2010) for smoothed online convex optimization (SOCO), in which the learner needs to minimize not only the hitting cost but also the switching cost. In the setting of learning with expert advice, Daniely and Mansour (2019) demonstrate that Discounted-Normal-Predictor can be utilized to yield nearly optimal regret bounds over any interval, even in the presence of switching costs. Inspired by their results, we develop a simple algorithm for SOCO: Combining online gradient descent (OGD) with different step sizes sequentially by Discounted-Normal-Predictor. Despite its simplicity, we prove that it is able to minimize the adaptive regret with switching cost, i.e., attaining nearly optimal regret with switching cost on every interval. By exploiting the theoretical guarantee of OGD for dynamic regret, we further show that the proposed algorithm can minimize the dynamic regret with switching cost in every interval.
In order to achieve terabits-per-second (Tbps) data rates in the sixth-generation (6G) mobile system, wireless communications are required to exploit the abundant spectrum in the millimeter-wave (mmWave) and terahertz (THz) bands. However, high-frequency transmission heavily relies on high beamforming gain to compensate for severe propagation loss. A beam-based system faces a barrier in the process of initial access, where a base station must broadcast synchronization signals and system information to all users within its coverage. Hence, this paper proposes a novel omnidirectional broadcasting scheme for mmWave and THz systems with hybrid beamforming. It provides an instantaneously equal gain over all directions by forming complementary beams over sub-arrays. Numerical results verify that it can achieve omnidirectional coverage with a performance that remarkably outperforms the previous scheme.
To meet the demand of supreme data rates in terabits-per-second, the next-generation mobile system needs to exploit the abundant spectrum in the millimeter-wave and terahertz bands. However, high-frequency transmission heavily relies on large-scale antenna arrays to reap high beamforming gain, used to compensate for severe propagation loss. It raises a problem of omni-directional beamforming during the phase of initial access, where a base station is required to broadcast synchronization signals and system information to all users within its coverage. This paper proposes a novel initial beamforming scheme, which provides instantaneous gain equally in all directions by forming a pair of complementary beams. Numerical results verify that it can achieve omni-directional coverage with the optimal performance that remarkably outperforms the previous scheme called random beamforming. It is applicable for any form of large-scale arrays, and all three architecture, i.e., digital, analog, and hybrid beamforming.
Exploiting the degree of freedom in the frequency domain and the near-far effect among different access points (APs), this paper proposes an opportunistic transmission scheme in cell-free massive MIMO-OFDM systems. The key idea is to orthogonally assign subcarriers among different users, so that there is only one user on each subcarrier. Then, a user is only served by its near APs through opportunistic selection, while the far APs are deactivated to avoid wasting power over their channels with severe propagation losses. Moreover, the number of active APs per subcarrier becomes small due to the opportunistic selection, making the use of downlink pilots and coherent detection feasible. As corroborated by numerical results, the proposed scheme can bring a significant performance boost in terms of both power efficiency and spectral efficiency.
Intelligent reflecting surface (IRS) is a cost-efficient technique to improve power efficiency and spectral efficiency. However, IRS-aided multi-antenna transmission needs to jointly optimize the passive and active beamforming, imposing a high computational burden and high latency due to its iterative optimization process. Making use of hybrid analog-digital beamforming in high-frequency transmission systems, a novel technique, coined dual-beam IRS, is proposed in this paper. The key idea is to form a pair of beams towards the IRS and user, respectively. Then, the optimization of passive and active beamforming can be decoupled, resulting in a simplified system design. Simulation results corroborate that it achieves a good balance between the cell-edge and cell-center performance. Compared with the performance bound, the gap is moderate, but it remarkably outperforms other sub-optimal schemes.
Visible-infrared person re-identification (VI-ReID) is a challenging and essential task, which aims to retrieve a set of person images over visible and infrared camera views. In order to mitigate the impact of large modality discrepancy existing in heterogeneous images, previous methods attempt to apply generative adversarial network (GAN) to generate the modality-consisitent data. However, due to severe color variations between the visible domain and infrared domain, the generated fake cross-modality samples often fail to possess good qualities to fill the modality gap between synthesized scenarios and target real ones, which leads to sub-optimal feature representations. In this work, we address cross-modality matching problem with Aligned Grayscale Modality (AGM), an unified dark-line spectrum that reformulates visible-infrared dual-mode learning as a gray-gray single-mode learning problem. Specifically, we generate the grasycale modality from the homogeneous visible images. Then, we train a style tranfer model to transfer infrared images into homogeneous grayscale images. In this way, the modality discrepancy is significantly reduced in the image space. In order to reduce the remaining appearance discrepancy, we further introduce a multi-granularity feature extraction network to conduct feature-level alignment. Rather than relying on the global information, we propose to exploit local (head-shoulder) features to assist person Re-ID, which complements each other to form a stronger feature descriptor. Comprehensive experiments implemented on the mainstream evaluation datasets include SYSU-MM01 and RegDB indicate that our method can significantly boost cross-modality retrieval performance against the state of the art methods.
Photorealistic rendering and reposing of humans is important for enabling augmented reality experiences. We propose a novel framework to reconstruct the human and the scene that can be rendered with novel human poses and views from just a single in-the-wild video. Given a video captured by a moving camera, we train two NeRF models: a human NeRF model and a scene NeRF model. To train these models, we rely on existing methods to estimate the rough geometry of the human and the scene. Those rough geometry estimates allow us to create a warping field from the observation space to the canonical pose-independent space, where we train the human model in. Our method is able to learn subject specific details, including cloth wrinkles and accessories, from just a 10 seconds video clip, and to provide high quality renderings of the human under novel poses, from novel views, together with the background.