Cardiac patterns are being used to obtain hard-to-forge biometric signatures and have led to high accuracy in state-of-the-art (SoA) identification applications. However, this performance is obtained under controlled scenarios where cardiac signals maintain a relatively uniform pattern, facilitating the identification process. In this work, we analyze cardiac signals collected in more realistic (uncontrolled) scenarios and show that their high signal variability (i.e., irregularity) makes it harder to obtain stable and distinct user features. Furthermore, SoA usually fails to identify specific groups of users, rendering existing identification methods futile in uncontrolled scenarios. To solve these problems, we propose a framework with three novel properties. First, we design an adaptive method that achieves stable and distinct features by tailoring the filtering spectrum to each user. Second, we show that users can have multiple cardiac morphologies, offering us a much bigger pool of cardiac signals and users compared to SoA. Third, we overcome other distortion effects present in authentication applications with a multi-cluster approach and the Mahalanobis distance. Our evaluation shows that the average balanced accuracy (BAC) of SoA drops from above 90% in controlled scenarios to 75% in uncontrolled ones, while our method maintains an average BAC above 90% in uncontrolled scenarios.
Federated learning (FL) is widely used in the Internet of Things (IoT), wireless networks, mobile devices, autonomous vehicles, and human activity due to its excellent potential in cybersecurity and privacy security. Though FL method can achieve privacy-safe and reliable collaborative training without collecting users' privacy data, it suffers from many challenges during both training and deployment. The main challenges in FL are the difficulty of non-i.i.d co-training data caused by the statistical diversity of the data from various participants, and the difficulty of application deployment caused by the excessive traffic volume and long communication delay between the central server and the client. To address these problems, we propose a sparse FL scheme with hierarchical personalization models (sFedHP), which minimizes clients' loss functions including the properties of an approximated L1-norm and the hierarchical proximal mapping, to reduce the communicational and computational loads required in the network, while improving the performance on statistical diversity data. Convergence analysis shows that the sparse constraint in sFedHP only reduces the convergence speed to a small extent, while the communication cost is greatly reduced. Experimentally, we demonstrate the benefits of this sparse hierarchical personalization architecture compared with the client-edge-cloud hierarchical FedAvg and the state-of-the-art personalization methods.
Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, enabling a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc). Code will be made publicly available at https://github.com/jamycheung/MatchFormer.
In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to effectively combine the video and audio modalities. Moreover, data augmentation strategies are investigated to increase audio-visual training set size. For the video modality the effectiveness of several operations in RandAugment is verified. An audio-video joint mixup scheme is proposed to further improve AVSC performances. Evaluated on the development set of TAU Urban Audio Visual Scenes 2021, our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
In this paper, we introduce VCSL (Video Copy Segment Localization), a new comprehensive segment-level annotated video copy dataset. Compared with existing copy detection datasets restricted by either video-level annotation or small-scale, VCSL not only has two orders of magnitude more segment-level labelled data, with 160k realistic video copy pairs containing more than 280k localized copied segment pairs, but also covers a variety of video categories and a wide range of video duration. All the copied segments inside each collected video pair are manually extracted and accompanied by precisely annotated starting and ending timestamps. Alongside the dataset, we also propose a novel evaluation protocol that better measures the prediction accuracy of copy overlapping segments between a video pair and shows improved adaptability in different scenarios. By benchmarking several baseline and state-of-the-art segment-level video copy detection methods with the proposed dataset and evaluation metric, we provide a comprehensive analysis that uncovers the strengths and weaknesses of current approaches, hoping to open up promising directions for future works. The VCSL dataset, metric and benchmark codes are all publicly available at https://github.com/alipay/VCSL.
In this work, two problems associated with a downlink multi-user system are considered with the aid of intelligent reflecting surface (IRS): weighted sum-rate maximization and weighted minimal-rate maximization. For the first problem, a novel DOuble Manifold ALternating Optimization (DOMALO) algorithm is proposed by exploiting the matrix manifold theory and introducing the beamforming matrix and reflection vector using complex sphere manifold and complex oblique manifold, respectively, which incorporate the inherent geometrical structure and the required constraint. A smooth double manifold alternating optimization (S-DOMALO) algorithm is then developed based on the Dinkelbach-type algorithm and smooth exponential penalty function for the second problem. Finally, possible cooperative beamforming gain between IRSs and the IRS phase shift with limited resolution is studied, providing a reference for practical implementation. Numerical results show that our proposed algorithms can significantly outperform the benchmark schemes.
The problem of extreme multi-label text classification (XMTC) is to recall some most relevant labels for a text from an extremely large label set. Though the methods based on deep pre-trained models have reached significant achievement, the pre-trained models are still not fully utilized. Label semantics has not attracted much attention so far, and the latent space between texts and labels has not been effectively explored. This paper constructs a novel guide network (GUDN) to help fine-tune the pre-trained model to instruct classification later. Also, we use the raw label semantics to effectively explore the latent space between texts and labels, which can further improve predicted accuracy. Experimental results demonstrate that GUDN outperforms state-of-the-art methods on several popular datasets. Our source code is released at https://github.com/wq2581/GUDN.
Humans ability to transfer knowledge through teaching is one of the essential aspects for human intelligence. A human teacher can track the knowledge of students to customize the teaching on students needs. With the rise of online education platforms, there is a similar need for machines to track the knowledge of students and tailor their learning experience. This is known as the Knowledge Tracing (KT) problem in the literature. Effectively solving the KT problem would unlock the potential of computer-aided education applications such as intelligent tutoring systems, curriculum learning, and learning materials' recommendation. Moreover, from a more general viewpoint, a student may represent any kind of intelligent agents including both human and artificial agents. Thus, the potential of KT can be extended to any machine teaching application scenarios which seek for customizing the learning experience for a student agent (i.e., a machine learning model). In this paper, we provide a comprehensive and systematic review for the KT literature. We cover a broad range of methods starting from the early attempts to the recent state-of-the-art methods using deep learning, while highlighting the theoretical aspects of models and the characteristics of benchmark datasets. Besides these, we shed light on key modelling differences between closely related methods and summarize them in an easy-to-understand format. Finally, we discuss current research gaps in the KT literature and possible future research and application directions.
Artificial intelligence (AI) based device identification improves the security of the internet of things (IoT), and accelerates the authentication process. However, existing approaches rely on the assumption that we can learn all the classes from the training set, namely, closed-set classification. To overcome the closed-set limitation, we propose a novel open set RF device identification method to classify unseen classes in the testing set. First, we design a specific convolution neural network (CNN) with a short-time Fourier transforming (STFT) pre-processing module, which efficiently recognizes the differences of feature maps learned from various RF device signals. Then to generate a representation of known class bounds, we estimate the probability map of the open-set via the OpenMax function. We conduct experiments on sampled data and voice signal sets, considering various pre-processing schemes, network structures, distance metrics, tail sizes, and openness degrees. The simulation results show the superiority of the proposed method in terms of robustness and accuracy.
Image harmonization aims at adjusting the appearance of the foreground to make it more compatible with the background. Due to a lack of understanding of the background illumination direction, existing works are incapable of generating a realistic foreground shading. In this paper, we decompose the image harmonization into two sub-problems: 1) illumination estimation of background images and 2) rendering of foreground objects. Before solving these two sub-problems, we first learn a direction-aware illumination descriptor via a neural rendering framework, of which the key is a Shading Module that decomposes the shading field into multiple shading components given depth information. Then we design a Background Illumination Estimation Module to extract the direction-aware illumination descriptor from the background. Finally, the illumination descriptor is used in conjunction with the neural rendering framework to generate the harmonized foreground image containing a novel harmonized shading. Moreover, we construct a photo-realistic synthetic image harmonization dataset that contains numerous shading variations by image-based lighting. Extensive experiments on this dataset demonstrate the effectiveness of the proposed method. Our dataset and code will be made publicly available.