Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Time": models, code, and papers

Sub-word Level Lip Reading With Visual Attention

Oct 14, 2021
Prajwal K R, Triantafyllos Afouras, Andrew Zisserman

Figure 1 for Sub-word Level Lip Reading With Visual Attention

Figure 2 for Sub-word Level Lip Reading With Visual Attention

Figure 3 for Sub-word Level Lip Reading With Visual Attention

Figure 4 for Sub-word Level Lip Reading With Visual Attention

The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper we focus on the unique challenges encountered in lip reading and propose tailored solutions. To that end we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a training pipeline that balances the lip reading performance with other key factors such as data and compute efficiency. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition.

Via

Access Paper or Ask Questions

Human-Computer Interaction Glow Up: Examining Operational Trust and Intention Towards Mars Autonomous Systems

Oct 28, 2021
Thomas Chan, Jeremy Argueta, Jazlyn Armendariz, Allison Graham, Sarah Hwang, Basak Ramaswamy, So Young Kim, Scott Davidoff

Figure 1 for Human-Computer Interaction Glow Up: Examining Operational Trust and Intention Towards Mars Autonomous Systems

Figure 2 for Human-Computer Interaction Glow Up: Examining Operational Trust and Intention Towards Mars Autonomous Systems

Tactful coordination on earth between hundreds of operators from diverse disciplines and backgrounds is needed to ensure that Martian rovers have a high likelihood of achieving their science goals while enduring the harsh environment of the red planet. The operations team includes many individuals, each with independent and overlapping objectives, working to decide what to execute on the Mars surface during the next planning period. The team must work together to understand each other's objectives and constraints within a fixed time period, often requiring frequent revision. This study examines the challenges faced during Mars surface operations, from high-level science objectives to formulating a valid, safe, and optimal activity plan that is ready to be radiated to the rover. Through this examination, we aim to illuminate how planning intent can be formulated and effectively communicated to future spacecrafts that will become more and more autonomous. Our findings reveal the intricate nature of human-to-human interactions that require a large array of soft skills and core competencies to communicate concurrently with science and engineering teams during plan formulation. Additionally, our findings exposed significant challenges in eliciting planning intent from operators, which will intensify in the future, as operators on the ground asynchronously co-operate the rover with the on board autonomy. Building a marvellous robot and landing it onto the Mars surface are remarkable feats -however, ensuring that scientists can get the best out of the mission is an ongoing challenge and will not cease to be a difficult task with increased autonomy.

* 9 pages, 1 figure, to appear in Proceedings of the 2021 American Institute of Aeronautics and Astronautics ASCEND Conference (AIAA ASCEND 2021)

Via

Access Paper or Ask Questions

Patch Based Transformation for Minimum Variance Beamformer Image Approximation Using Delay and Sum Pipeline

Oct 19, 2021
Sairoop Bodepudi, A N Madhavanunni, Mahesh Raveendranatha Panicker

Figure 1 for Patch Based Transformation for Minimum Variance Beamformer Image Approximation Using Delay and Sum Pipeline

Figure 2 for Patch Based Transformation for Minimum Variance Beamformer Image Approximation Using Delay and Sum Pipeline

Figure 3 for Patch Based Transformation for Minimum Variance Beamformer Image Approximation Using Delay and Sum Pipeline

Figure 4 for Patch Based Transformation for Minimum Variance Beamformer Image Approximation Using Delay and Sum Pipeline

In the recent past, there have been several efforts in accelerating computationally heavy beamforming algorithms such as minimum variance distortionless response (MVDR) beamforming to achieve real-time performance comparable to the popular delay and sum (DAS) beamforming. This has been achieved using a variety of neural network architectures ranging from fully connected neural networks (FCNNs), convolutional neural networks (CNNs) and general adversarial networks (GANs). However most of these approaches are working with optimizations considering image level losses and hence require a significant amount of dataset to ensure that the process of beamforming is learned. In this work, a patch level U-Net based neural network is proposed, where the delay compensated radio frequency (RF) patch for a fixed region in space (e.g. 32x32) is transformed through a U-Net architecture and multiplied with DAS apodization weights and optimized for similarity with MVDR image of the patch. Instead of framing the beamforming problem as a regression problem to estimate the apodization weights, the proposed approach treats the non-linear transformation of the RF data space that can account for the data driven weight adaptation done by the MVDR approach in the parameters of the network. In this way, it is also observed that by restricting the input to a patch the model will learn the beamforming pipeline as an image non-linear transformation problem.

* 6 pages, 3 figures

Via

Access Paper or Ask Questions

3N-GAN: Semi-Supervised Classification of X-Ray Images with a 3-Player Adversarial Framework

Sep 22, 2021
Shafin Haque, Ayaan Haque

Figure 1 for 3N-GAN: Semi-Supervised Classification of X-Ray Images with a 3-Player Adversarial Framework

Figure 2 for 3N-GAN: Semi-Supervised Classification of X-Ray Images with a 3-Player Adversarial Framework

The success of deep learning for medical imaging tasks, such as classification, is heavily reliant on the availability of large-scale datasets. However, acquiring datasets with large quantities of labeled data is challenging, as labeling is expensive and time-consuming. Semi-supervised learning (SSL) is a growing alternative to fully-supervised learning, but requires unlabeled samples for training. In medical imaging, many datasets lack unlabeled data entirely, so SSL can't be conventionally utilized. We propose 3N-GAN, or 3 Network Generative Adversarial Networks, to perform semi-supervised classification of medical images in fully-supervised settings. We incorporate a classifier into the adversarial relationship such that the generator trains adversarially against both the classifier and discriminator. Our preliminary results show improved classification performance and GAN generations over various algorithms. Our work can seamlessly integrate with numerous other medical imaging model architectures and SSL methods for greater performance.

* 5 pages, 2 figures; authors contributed equally

Via

Access Paper or Ask Questions

Efficient Explanations for Knowledge Compilation Languages

Jul 04, 2021
Xuanxiang Huang, Yacine Izza, Alexey Ignatiev, Martin C. Cooper, Nicholas Asher, Joao Marques-Silva

Figure 1 for Efficient Explanations for Knowledge Compilation Languages

Figure 2 for Efficient Explanations for Knowledge Compilation Languages

Figure 3 for Efficient Explanations for Knowledge Compilation Languages

Figure 4 for Efficient Explanations for Knowledge Compilation Languages

Knowledge compilation (KC) languages find a growing number of practical uses, including in Constraint Programming (CP) and in Machine Learning (ML). In most applications, one natural question is how to explain the decisions made by models represented by a KC language. This paper shows that for many of the best known KC languages, well-known classes of explanations can be computed in polynomial time. These classes include deterministic decomposable negation normal form (d-DNNF), and so any KC language that is strictly less succinct than d-DNNF. Furthermore, the paper also investigates the conditions under which polynomial time computation of explanations can be extended to KC languages more succinct than d-DNNF.

Via

Access Paper or Ask Questions

Dynamic Gesture Recognition

Sep 22, 2021
Jonas Bokstaller, Costanza Maria Improta

Figure 1 for Dynamic Gesture Recognition

Figure 2 for Dynamic Gesture Recognition

Figure 3 for Dynamic Gesture Recognition

Figure 4 for Dynamic Gesture Recognition

The Human-Machine Interaction (HMI) research field is an important topic in machine learning that has been deeply investigated thanks to the rise of computing power in the last years. The first time, it is possible to use machine learning to classify images and/or videos instead of the traditional computer vision algorithms. The aim of this project is to builda symbiosis between a convolutional neural network (CNN)[1] and a recurrent neural network (RNN) [2] to recognize cultural/anthropological Italian sign language gestures from videos. The CNN extracts important features that later areused by the RNN. With RNNs we are able to store temporal information inside the model to provide contextual information from previous frames to enhance the prediction accuracy. Our novel approach uses different data augmentation techniquesand regularization methods from only RGB frames to avoid overfitting and provide a small generalization error.

* 3 pages, 5 figures

Via

Access Paper or Ask Questions

A Dynamic Keypoints Selection Network for 6DoF Pose Estimation

Oct 24, 2021
Haowen Sun, Taiyong Wang

Figure 1 for A Dynamic Keypoints Selection Network for 6DoF Pose Estimation

Figure 2 for A Dynamic Keypoints Selection Network for 6DoF Pose Estimation

Figure 3 for A Dynamic Keypoints Selection Network for 6DoF Pose Estimation

Figure 4 for A Dynamic Keypoints Selection Network for 6DoF Pose Estimation

6 DoF poses estimation problem aims to estimate the rotation and translation parameters between two coordinates, such as object world coordinate and camera world coordinate. Although some advances are made with the help of deep learning, how to full use scene information is still a problem. Prior works tackle the problem by pixel-wise feature fusion but need to randomly selecte numerous points from images, which can not satisfy the demands of fast inference simultaneously and accurate pose estimation. In this work, we present a novel deep neural network based on dynamic keypoints selection designed for 6DoF pose estimation from a single RGBD image. Our network includes three parts, instance semantic segmentation, edge points detection and 6DoF pose estimation. Given an RGBD image, our network is trained to predict pixel category and the translation to edge points and center points. Then, a least-square fitting manner is applied to estimate the 6DoF pose parameters. Specifically, we propose a dynamic keypoints selection algorithm to choose keypoints from the foreground feature map. It allows us to leverage geometric and appearance information. During 6DoF pose estimation, we utilize the instance semantic segmentation result to filter out background points and only use foreground points to finish edge points detection and 6DoF pose estimation. Experiments on two commonly used 6DoF estimation benchmark datasets, YCB-Video and LineMoD, demonstrate that our method outperforms the state-of-the-art methods and achieves significant improvements over other same category methods time efficiency.

* 13 pages, 4 figures

Via

Access Paper or Ask Questions

A New Backbone for Hyperspectral Image Reconstruction

Aug 17, 2021
Jiamian Wang, Yulun Zhang, Xin Yuan, Yun Fu, Zhiqiang Tao

Figure 1 for A New Backbone for Hyperspectral Image Reconstruction

Figure 2 for A New Backbone for Hyperspectral Image Reconstruction

Figure 3 for A New Backbone for Hyperspectral Image Reconstruction

Figure 4 for A New Backbone for Hyperspectral Image Reconstruction

The study of 3D hyperspectral image (HSI) reconstruction refers to the inverse process of snapshot compressive imaging, during which the optical system, e.g., the coded aperture snapshot spectral imaging (CASSI) system, captures the 3D spatial-spectral signal and encodes it to a 2D measurement. While numerous sophisticated neural networks have been elaborated for end-to-end reconstruction, trade-offs still need to be made among performance, efficiency (training and inference time), and feasibility (the ability of restoring high resolution HSI on limited GPU memory). This raises a challenge to design a new baseline to conjointly meet the above requirements. In this paper, we fill in this blank by proposing a Spatial/Spectral Invariant Residual U-Net, namely SSI-ResU-Net. It differentiates with U-Net in three folds--1) scale/spectral-invariant learning, 2) nested residual learning, and 3) computational efficiency. Benefiting from these three modules, the proposed SSI-ResU-Net outperforms the current state-of-the-art method TSA-Net by over 3 dB in PSNR and 0.036 in SSIM while only using 2.82% trainable parameters. To the greatest extent, SSI-ResU-Net achieves competing performance with over 77.3% reduction in terms of floating-point operations (FLOPs), which for the first time, makes high-resolution HSI reconstruction feasible under practical application scenarios. Code and pre-trained models are made available at https://github.com/Jiamian-Wang/HSI_baseline.

Via

Access Paper or Ask Questions

An Internal Clock Based Space-time Neural Network for Motion Speed Recognition

Jan 28, 2020
Junwen Luo, Jiaoyan Chen

Figure 1 for An Internal Clock Based Space-time Neural Network for Motion Speed Recognition

Figure 2 for An Internal Clock Based Space-time Neural Network for Motion Speed Recognition

Figure 3 for An Internal Clock Based Space-time Neural Network for Motion Speed Recognition

Figure 4 for An Internal Clock Based Space-time Neural Network for Motion Speed Recognition

In this work we present a novel internal clock based space-time neural network for motion speed recognition. The developed system has a spike train encoder, a Spiking Neural Network (SNN) with internal clocking behaviors, a pattern transformation block and a Network Dynamic Dependent Plasticity (NDDP) learning block. The core principle is that the developed SNN will automatically tune its network pattern frequency (internal clock frequency) to recognize human motions in a speed domain. We employed both cartoons and real-world videos as training benchmarks, results demonstrate that our system can not only recognize motions with considerable speed differences (e.g. run, walk, jump, wonder(think) and standstill), but also motions with subtle speed gaps such as run and fast walk. The inference accuracy can be up to 83.3% (cartoon videos) and 75% (real-world videos). Meanwhile, the system only requires six video datasets in the learning stage and with up to 42 training trials. Hardware performance estimation indicates that the training time is 0.84-4.35s and power consumption is 33.26-201mW (based on an ARM Cortex M4 processor). Therefore, our system takes unique learning advantages of the requirement of the small dataset, quick learning and low power performance, which shows great potentials for edge or scalable AI-based applications.

* To appear in Neuro-inspired Computational Elements Workshop (NICE 20). March 26-28,2020 Heidelberg, Germany, 8 pages

Via

Access Paper or Ask Questions

Neural Scene Flow Prior

Nov 01, 2021
Xueqian Li, Jhony Kaesemodel Pontes, Simon Lucey

Before the deep learning revolution, many perception algorithms were based on runtime optimization in conjunction with a strong prior/regularization penalty. A prime example of this in computer vision is optical and scene flow. Supervised learning has largely displaced the need for explicit regularization. Instead, they rely on large amounts of labeled data to capture prior statistics, which are not always readily available for many problems. Although optimization is employed to learn the neural network, the weights of this network are frozen at runtime. As a result, these learning solutions are domain-specific and do not generalize well to other statistically different scenarios. This paper revisits the scene flow problem that relies predominantly on runtime optimization and strong regularization. A central innovation here is the inclusion of a neural scene flow prior, which uses the architecture of neural networks as a new type of implicit regularizer. Unlike learning-based scene flow methods, optimization occurs at runtime, and our approach needs no offline datasets -- making it ideal for deployment in new environments such as autonomous driving. We show that an architecture based exclusively on multilayer perceptrons (MLPs) can be used as a scene flow prior. Our method attains competitive -- if not better -- results on scene flow benchmarks. Also, our neural prior's implicit and continuous scene flow representation allows us to estimate dense long-term correspondences across a sequence of point clouds. The dense motion information is represented by scene flow fields where points can be propagated through time by integrating motion vectors. We demonstrate such a capability by accumulating a sequence of lidar point clouds.

* accepted by NeurIPS 2021 as "spotlight"

Via

Access Paper or Ask Questions