Adapting visual object detectors to operational target domains is a challenging task, commonly achieved using unsupervised domain adaptation (UDA) methods. When the labeled dataset is coming from multiple source domains, treating them as separate domains and performing a multi-source domain adaptation (MSDA) improves the accuracy and robustness over mixing these source domains and performing a UDA, as observed by recent studies in MSDA. Existing MSDA methods learn domain invariant and domain-specific parameters (for each source domain) for the adaptation. However, unlike single-source UDA methods, learning domain-specific parameters makes them grow significantly proportional to the number of source domains used. This paper proposes a novel MSDA method called Prototype-based Mean-Teacher (PMT), which uses class prototypes instead of domain-specific subnets to preserve domain-specific information. These prototypes are learned using a contrastive loss, aligning the same categories across domains and separating different categories far apart. Because of the use of prototypes, the parameter size of our method does not increase significantly with the number of source domains, thus reducing memory issues and possible overfitting. Empirical studies show PMT outperforms state-of-the-art MSDA methods on several challenging object detection datasets.
Starting with small and simple concepts, and gradually introducing complex and difficult concepts is the natural process of human learning. Spiking Neural Networks (SNNs) aim to mimic the way humans process information, but current SNNs models treat all samples equally, which does not align with the principles of human learning and overlooks the biological plausibility of SNNs. To address this, we propose a CL-SNN model that introduces Curriculum Learning(CL) into SNNs, making SNNs learn more like humans and providing higher biological interpretability. CL is a training strategy that advocates presenting easier data to models before gradually introducing more challenging data, mimicking the human learning process. We use a confidence-aware loss to measure and process the samples with different difficulty levels. By learning the confidence of different samples, the model reduces the contribution of difficult samples to parameter optimization automatically. We conducted experiments on static image datasets MNIST, Fashion-MNIST, CIFAR10, and neuromorphic datasets N-MNIST, CIFAR10-DVS, DVS-Gesture. The results are promising. To our best knowledge, this is the first proposal to enhance the biologically plausibility of SNNs by introducing CL.
Stroke is the second leading cause of mortality worldwide. Immediate attention and diagnosis play a crucial role regarding patient prognosis. The key to diagnosis consists in localizing and delineating brain lesions. Standard stroke examination protocols include the initial evaluation from a non-contrast CT scan to discriminate between hemorrhage and ischemia. However, non-contrast CTs may lack sensitivity in detecting subtle ischemic changes in the acute phase. As a result, complementary diffusion-weighted MRI studies are captured to provide valuable insights, allowing to recover and quantify stroke lesions. This work introduced APIS, the first paired public dataset with NCCT and ADC studies of acute ischemic stroke patients. APIS was presented as a challenge at the 20th IEEE International Symposium on Biomedical Imaging 2023, where researchers were invited to propose new computational strategies that leverage paired data and deal with lesion segmentation over CT sequences. Despite all the teams employing specialized deep learning tools, the results suggest that the ischemic stroke segmentation task from NCCT remains challenging. The annotated dataset remains accessible to the public upon registration, inviting the scientific community to deal with stroke characterization from NCCT but guided with paired DWI information.
Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without necessitating source-domain information. We examine our VPA design under diverse adaptation settings, encompassing single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results reveal that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
Semi-supervised learning (SSL) has been proven beneficial for mitigating the issue of limited labeled data especially on the task of volumetric medical image segmentation. Unlike previous SSL methods which focus on exploring highly confident pseudo-labels or developing consistency regularization schemes, our empirical findings suggest that inconsistent decoder features emerge naturally when two decoders strive to generate consistent predictions. Based on the observation, we first analyze the treasure of discrepancy in learning towards consistency, under both pseudo-labeling and consistency regularization settings, and subsequently propose a novel SSL method called LeFeD, which learns the feature-level discrepancy obtained from two decoders, by feeding the discrepancy as a feedback signal to the encoder. The core design of LeFeD is to enlarge the difference by training differentiated decoders, and then learn from the inconsistent information iteratively. We evaluate LeFeD against eight state-of-the-art (SOTA) methods on three public datasets. Experiments show LeFeD surpasses competitors without any bells and whistles such as uncertainty estimation and strong constraints, as well as setting a new state-of-the-art for semi-supervised medical image segmentation. Code is available at \textcolor{cyan}{https://github.com/maxwell0027/LeFeD}
In most factories today the robotic cells are deployed on well enforced bases to avoid any external impact on the accuracy of production. In contrast to that, we evaluate a futuristic concept where the whole robotic cell could work in a moving platform. Imagine a trailer of a truck moving along the motorway while exposed to heavy physical impacts due to maneuvering. The key question here is how the robotic cell behaves and how the productivity is affected. We propose a system architecture (FATHER) and show some solutions including network related information and artificial intelligence to make the proposed futuristic concept feasible to implement.
Modern computational natural philosophy conceptualizes the universe in terms of information and computation, establishing a framework for the study of cognition and intelligence. Despite some critiques, this computational perspective has significantly influenced our understanding of the natural world, leading to the development of AI systems like ChatGPT based on deep neural networks. Advancements in this domain have been facilitated by interdisciplinary research, integrating knowledge from multiple fields to simulate complex systems. Large Language Models (LLMs), such as ChatGPT, represent this approach's capabilities, utilizing reinforcement learning with human feedback (RLHF). Current research initiatives aim to integrate neural networks with symbolic computing, introducing a new generation of hybrid computational models.
Recently, intelligent reflecting surface (IRS)-assisted communication has gained considerable attention due to its advantage in extending the coverage and compensating the path loss with low-cost passive metasurface. This paper considers the uplink channel estimation for IRS-aided multiuser massive MISO communications with one-bit ADCs at the base station (BS). The use of one-bit ADC is impelled by the low-cost and power efficient implementation of massive antennas techniques. However, the passiveness of IRS and the lack of signal level information after one-bit quantization make the IRS channel estimation challenging. To tackle this problem, we exploit the structured sparsity of the user-IRS-BS cascaded channels and develop three channel estimators, each of which utilizes the structured sparsity at different levels. Specifically, the first estimator exploits the elementwise sparsity of the cascaded channel and employs the sparse Bayesian learning (SBL) to infer the channel responses via the type-II maximum likelihood (ML) estimation. However, due to the one-bit quantization, the type-II ML in general is intractable. As such, a variational expectation-maximization (EM) algorithm is custom-derived to iteratively compute an ML solution. The second estimator utilizes the common row-structured sparsity induced by the IRS-to-BS channel shared among the users, and develops another type-II ML solution via the block SBL (BSBL) and the variational EM. To further improve the performance of BSBL, a third two-stage estimator is proposed, which can utilize both the common row-structured sparsity and the column-structured sparsity arising from the limited scattering around the users. Simulation results show that the more diverse structured sparsity is exploited, the better estimation performance is achieved, and that the proposed estimators are superior to state-of-the-art one-bit estimators.
Tensor decomposition methods are popular tools for analysis of multi-way datasets from social media, healthcare, spatio-temporal domains, and others. Widely adopted models such as Tucker and canonical polyadic decomposition (CPD) follow a data-driven philosophy: they decompose a tensor into factors that approximate the observed data well. In some cases side information is available about the tensor modes. For example, in a temporal user-item purchases tensor a user influence graph, an item similarity graph, and knowledge about seasonality or trends in the temporal mode may be available. Such side information may enable more succinct and interpretable tensor decomposition models and improved quality in downstream tasks. We propose a framework for Multi-Dictionary Tensor Decomposition (MDTD) which takes advantage of prior structural information about tensor modes in the form of coding dictionaries to obtain sparsely encoded tensor factors. We derive a general optimization algorithm for MDTD that handles both complete input and input with missing values. Our framework handles large sparse tensors typical to many real-world application domains. We demonstrate MDTD's utility via experiments with both synthetic and real-world datasets. It learns more concise models than dictionary-free counterparts and improves (i) reconstruction quality ($60\%$ fewer non-zero coefficients coupled with smaller error); (ii) missing values imputation quality (two-fold MSE reduction with up to orders of magnitude time savings) and (iii) the estimation of the tensor rank. MDTD's quality improvements do not come with a running time premium: it can decompose $19GB$ datasets in less than a minute. It can also impute missing values in sparse billion-entry tensors more accurately and scalably than state-of-the-art competitors.
Purpose of Review: The field of humanoid robotics, perception plays a fundamental role in enabling robots to interact seamlessly with humans and their surroundings, leading to improved safety, efficiency, and user experience. This scientific study investigates various perception modalities and techniques employed in humanoid robots, including visual, auditory, and tactile sensing by exploring recent state-of-the-art approaches for perceiving and understanding the internal state, the environment, objects, and human activities. Recent Findings: Internal state estimation makes extensive use of Bayesian filtering methods and optimization techniques based on maximum a-posteriori formulation by utilizing proprioceptive sensing. In the area of external environment understanding, with an emphasis on robustness and adaptability to dynamic, unforeseen environmental changes, the new slew of research discussed in this study have focused largely on multi-sensor fusion and machine learning in contrast to the use of hand-crafted, rule-based systems. Human robot interaction methods have established the importance of contextual information representation and memory for understanding human intentions. Summary: This review summarizes the recent developments and trends in the field of perception in humanoid robots. Three main areas of application are identified, namely, internal state estimation, external environment estimation, and human robot interaction. The applications of diverse sensor modalities in each of these areas are considered and recent significant works are discussed.