Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"autonomous cars": models, code, and papers

Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models

Jan 31, 2022
Viet Quoc Vo, Ehsan Abbasnejad, Damith C. Ranasinghe

Despite our best efforts, deep learning models remain highly vulnerable to even tiny adversarial perturbations applied to the inputs. The ability to extract information from solely the output of a machine learning model to craft adversarial perturbations to black-box models is a practical threat against real-world systems, such as autonomous cars or machine learning models exposed as a service (MLaaS). Of particular interest are sparse attacks. The realization of sparse attacks in black-box models demonstrates that machine learning models are more vulnerable than we believe. Because these attacks aim to minimize the number of perturbed pixels measured by l_0 norm-required to mislead a model by solely observing the decision (the predicted label) returned to a model query; the so-called decision-based attack setting. But, such an attack leads to an NP-hard optimization problem. We develop an evolution-based algorithm-SparseEvo-for the problem and evaluate against both convolutional deep neural networks and vision transformers. Notably, vision transformers are yet to be investigated under a decision-based attack setting. SparseEvo requires significantly fewer model queries than the state-of-the-art sparse attack Pointwise for both untargeted and targeted attacks. The attack algorithm, although conceptually simple, is also competitive with only a limited query budget against the state-of-the-art gradient-based whitebox attacks in standard computer vision tasks such as ImageNet. Importantly, the query efficient SparseEvo, along with decision-based attacks, in general, raise new questions regarding the safety of deployed systems and poses new directions to study and understand the robustness of machine learning models.

* Published as a conference paper at the International Conference on Learning Representations (ICLR 2022) 

EffCNet: An Efficient CondenseNet for Image Classification on NXP BlueBox

Nov 28, 2021
Priyank Kalgaonkar, Mohamed El-Sharkawy

Intelligent edge devices with built-in processors vary widely in terms of capability and physical form to perform advanced Computer Vision (CV) tasks such as image classification and object detection, for example. With constant advances in the field of autonomous cars and UAVs, embedded systems and mobile devices, there has been an ever-growing demand for extremely efficient Artificial Neural Networks (ANN) for real-time inference on these smart edge devices with constrained computational resources. With unreliable network connections in remote regions and an added complexity of data transmission, it is of an utmost importance to capture and process data locally instead of sending the data to cloud servers for remote processing. Edge devices on the other hand, offer limited processing power due to their inexpensive hardware, and limited cooling and computational resources. In this paper, we propose a novel deep convolutional neural network architecture called EffCNet which is an improved and an efficient version of CondenseNet Convolutional Neural Network (CNN) for edge devices utilizing self-querying data augmentation and depthwise separable convolutional strategies to improve real-time inference performance as well as reduce the final trained model size, trainable parameters, and Floating-Point Operations (FLOPs) of EffCNet CNN. Furthermore, extensive supervised image classification analyses are conducted on two benchmarking datasets: CIFAR-10 and CIFAR-100, to verify real-time inference performance of our proposed CNN. Finally, we deploy these trained weights on NXP BlueBox which is an intelligent edge development platform designed for self-driving vehicles and UAVs, and conclusions will be extrapolated accordingly.

* Vol. 5, No. 2, 2021, pp. 77-87 
* 11 pages, 10 figures, published in American Journal of Electrical and Computer Engineering 

Self-Supervised Learning of Depth and Camera Motion from 360° Videos

Nov 13, 2018
Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, Min Sun

As 360{\deg} cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360{\deg} perception becomes more and more important. We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360{\deg} video. In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360{\deg} images efficiently. Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries. Secondly, we propose a novel "spherical" photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view. Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus. To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360{\deg} videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation on PanoSUNCG with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction by acquiring model pre-trained on PanoSUNCG.

* ACCV 2018 Oral 

Learning Invariant World State Representations with Predictive Coding

Jul 06, 2022
Avi Ziskind, Sujeong Kim, Giedrius T. Burachas

Self-supervised learning methods overcome the key bottleneck for building more capable AI: limited availability of labeled data. However, one of the drawbacks of self-supervised architectures is that the representations that they learn are implicit and it is hard to extract meaningful information about the encoded world states, such as 3D structure of the visual scene encoded in a depth map. Moreover, in the visual domain such representations only rarely undergo evaluations that may be critical for downstream tasks, such as vision for autonomous cars. Herein, we propose a framework for evaluating visual representations for illumination invariance in the context of depth perception. We develop a new predictive coding-based architecture and a hybrid fully-supervised/self-supervised learning method. We propose a novel architecture that extends the predictive coding approach: PRedictive Lateral bottom-Up and top-Down Encoder-decoder Network (PreludeNet), which explicitly learns to infer and predict depth from video frames. In PreludeNet, the encoder's stack of predictive coding layers is trained in a self-supervised manner, while the predictive decoder is trained in a supervised manner to infer or predict the depth. We evaluate the robustness of our model on a new synthetic dataset, in which lighting conditions (such as overall illumination, and effect of shadows) can be be parametrically adjusted while keeping all other aspects of the world constant. PreludeNet achieves both competitive depth inference performance and next frame prediction accuracy. We also show how this new network architecture, coupled with the hybrid fully-supervised/self-supervised learning method, achieves balance between the said performance and invariance to changes in lighting. The proposed framework for evaluating visual representations can be extended to diverse task domains and invariance tests.

* 11 pages, 5 figures, submitted 

PixSelect: Less but Reliable Pixels for Accurate and Efficient Localization

Jun 08, 2022
Mohammad Altillawi

Accurate camera pose estimation is a fundamental requirement for numerous applications, such as autonomous driving, mobile robotics, and augmented reality. In this work, we address the problem of estimating the global 6 DoF camera pose from a single RGB image in a given environment. Previous works consider every part of the image valuable for localization. However, many image regions such as the sky, occlusions, and repetitive non-distinguishable patterns cannot be utilized for localization. In addition to adding unnecessary computation efforts, extracting and matching features from such regions produce many wrong matches which in turn degrades the localization accuracy and efficiency. Our work addresses this particular issue and shows by exploiting an interesting concept of sparse 3D models that we can exploit discriminatory environment parts and avoid useless image regions for the sake of a single image localization. Interestingly, through avoiding selecting keypoints from non-reliable image regions such as trees, bushes, cars, pedestrians, and occlusions, our work acts naturally as an outlier filter. This makes our system highly efficient in that minimal set of correspondences is needed and highly accurate as the number of outliers is low. Our work exceeds state-ofthe-art methods on outdoor Cambridge Landmarks dataset. With only relying on single image at inference, it outweighs in terms of accuracy methods that exploit pose priors and/or reference 3D models while being much faster. By choosing as little as 100 correspondences, it surpasses similar methods that localize from thousands of correspondences, while being more efficient. In particular, it achieves, compared to these methods, an improvement of localization by 33% on OldHospital scene. Furthermore, It outstands direct pose regressors even those that learn from sequence of images

* IEEE International Conference on Robotics and Automation (ICRA), May 23-27, 2022. Philadelphia, PA, USA, p 4156-4162 

Embodied AI-Driven Operation of Smart Cities: A Concise Review

Aug 22, 2021
Farzan Shenavarmasouleh, Farid Ghareh Mohammadi, M. Hadi Amini, Hamid R. Arabnia

A smart city can be seen as a framework, comprised of Information and Communication Technologies (ICT). An intelligent network of connected devices that collect data with their sensors and transmit them using cloud technologies in order to communicate with other assets in the ecosystem plays a pivotal role in this framework. Maximizing the quality of life of citizens, making better use of resources, cutting costs, and improving sustainability are the ultimate goals that a smart city is after. Hence, data collected from connected devices will continuously get thoroughly analyzed to gain better insights into the services that are being offered across the city; with this goal in mind that they can be used to make the whole system more efficient. Robots and physical machines are inseparable parts of a smart city. Embodied AI is the field of study that takes a deeper look into these and explores how they can fit into real-world environments. It focuses on learning through interaction with the surrounding environment, as opposed to Internet AI which tries to learn from static datasets. Embodied AI aims to train an agent that can See (Computer Vision), Talk (NLP), Navigate and Interact with its environment (Reinforcement Learning), and Reason (General Intelligence), all at the same time. Autonomous driving cars and personal companions are some of the examples that benefit from Embodied AI nowadays. In this paper, we attempt to do a concise review of this field. We will go through its definitions, its characteristics, and its current achievements along with different algorithms, approaches, and solutions that are being used in different components of it (e.g. Vision, NLP, RL). We will then explore all the available simulators and 3D interactable databases that will make the research in this area feasible. Finally, we will address its challenges and identify its potentials for future research.

* Cyberphysical Smart Cities Infrastructures: Optimal Operation and Intelligent Decision Making 2021 

Reliable counting of weakly labeled concepts by a single spiking neuron model

May 19, 2018
Hannes Rapp, Martin Paul Nawrot, Merav Stern

Making an informed, correct and quick decision can be life-saving. It's crucial for animals during an escape behaviour or for autonomous cars during driving. The decision can be complex and may involve an assessment of the amount of threats present and the nature of each threat. Thus, we should expect early sensory processing to supply classification information fast and accurately, even before relying the information to higher brain areas or more complex system components downstream. Today, advanced convolution artificial neural networks can successfully solve such tasks and are commonly used to build complex decision making systems. However, in order to achieve excellent performance on these tasks they require increasingly complex, "very deep" model structure, which is costly in inference run-time, energy consumption and training samples, only trainable on cloud-computing clusters. A single spiking neuron has been shown to be able to solve many of these required tasks for homogeneous Poisson input statistics, a commonly used model for spiking activity in the neocortex; when modeled as leaky integrate and fire with gradient decent learning algorithm it was shown to posses a wide variety of complex computational capabilities. Here we improve its learning algorithm. We also account for more natural stimulus generated inputs that deviate from this homogeneous Poisson spiking. The improved gradient-based local learning rule allows for significantly better and stable generalization and more efficient performance. We finally apply our model to a problem of multiple instance learning with counting where labels are only available for collections of concepts. In this counting MNIST task the neuron exploits the improved algorithm and succeeds while out performing the previously introduced single neuron learning algorithm as well as conventional ConvNet architecture under similar conditions.


Multi-Attention Network for Compressed Video Referring Object Segmentation

Jul 26, 2022
Weidong Chen, Dexiang Hong, Yuankai Qi, Zhenjun Han, Shuhui Wang, Laiyun Qing, Qingming Huang, Guorong Li

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at:

* Accepted by ACM MM 2022 

FPS-Net: A Convolutional Fusion Network for Large-Scale LiDAR Point Cloud Segmentation

Mar 01, 2021
Aoran Xiao, Xiaofei Yang, Shijian Lu, Dayan Guan, Jiaxing Huang

Scene understanding based on LiDAR point cloud is an essential task for autonomous cars to drive safely, which often employs spherical projection to map 3D point cloud into multi-channel 2D images for semantic segmentation. Most existing methods simply stack different point attributes/modalities (e.g. coordinates, intensity, depth, etc.) as image channels to increase information capacity, but ignore distinct characteristics of point attributes in different image channels. We design FPS-Net, a convolutional fusion network that exploits the uniqueness and discrepancy among the projected image channels for optimal point cloud segmentation. FPS-Net adopts an encoder-decoder structure. Instead of simply stacking multiple channel images as a single input, we group them into different modalities to first learn modality-specific features separately and then map the learned features into a common high-dimensional feature space for pixel-level fusion and learning. Specifically, we design a residual dense block with multiple receptive fields as a building block in the encoder which preserves detailed information in each modality and learns hierarchical modality-specific and fused features effectively. In the FPS-Net decoder, we use a recurrent convolution block likewise to hierarchically decode fused features into output space for pixel-level classification. Extensive experiments conducted on two widely adopted point cloud datasets show that FPS-Net achieves superior semantic segmentation as compared with state-of-the-art projection-based methods. In addition, the proposed modality fusion idea is compatible with typical projection-based methods and can be incorporated into them with consistent performance improvements.

* 20 pages, 7 figures 

Cooperating with Machines

Feb 21, 2018
Jacob W. Crandall, Mayada Oudah, Tennom, Fatimah Ishowo-Oloko, Sherief Abdallah, Jean-François Bonnefon, Manuel Cebrian, Azim Shariff, Michael A. Goodrich, Iyad Rahwan

Since Alan Turing envisioned Artificial Intelligence (AI) [1], a major driving force behind technical progress has been competition with human cognition. Historical milestones have been frequently associated with computers matching or outperforming humans in difficult cognitive tasks (e.g. face recognition [2], personality classification [3], driving cars [4], or playing video games [5]), or defeating humans in strategic zero-sum encounters (e.g. Chess [6], Checkers [7], Jeopardy! [8], Poker [9], or Go [10]). In contrast, less attention has been given to developing autonomous machines that establish mutually cooperative relationships with people who may not share the machine's preferences. A main challenge has been that human cooperation does not require sheer computational power, but rather relies on intuition [11], cultural norms [12], emotions and signals [13, 14, 15, 16], and pre-evolved dispositions toward cooperation [17], common-sense mechanisms that are difficult to encode in machines for arbitrary contexts. Here, we combine a state-of-the-art machine-learning algorithm with novel mechanisms for generating and acting on signals to produce a new learning algorithm that cooperates with people and other machines at levels that rival human cooperation in a variety of two-player repeated stochastic games. This is the first general-purpose algorithm that is capable, given a description of a previously unseen game environment, of learning to cooperate with people within short timescales in scenarios previously unanticipated by algorithm designers. This is achieved without complex opponent modeling or higher-order theories of mind, thus showing that flexible, fast, and general human-machine cooperation is computationally achievable using a non-trivial, but ultimately simple, set of algorithmic mechanisms.

* Nature Communications, Vol. 9, Article No. 233, 2018 
* An updated version of this paper was published in Nature Communications