We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic ($\Omega(d^2)$), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.
In visual recognition, scale bias is a key challenge due to the imbalance of object and image size distribution inherent in real scene datasets. Conventional solutions involve injecting scale invariance priors, oversampling the dataset at different scales during training, or adjusting scale at inference. While these strategies mitigate scale bias to some extent, their ability to adapt across diverse datasets is limited. Besides, they increase computational load during training and latency during inference. In this work, we use adaptive attentional processing -- oversampling salient object regions by warping images in-place during training. Discovering that shifting the source scale distribution improves backbone features, we developed a instance-level warping guidance aimed at object region sampling to mitigate source scale bias in domain adaptation. Our approach improves adaptation across geographies, lighting and weather conditions, is agnostic to the task, domain adaptation algorithm, saliency guidance, and underlying model architecture. Highlights include +6.1 mAP50 for BDD100K Clear $\rightarrow$ DENSE Foggy, +3.7 mAP50 for BDD100K Day $\rightarrow$ Night, +3.0 mAP50 for BDD100K Clear $\rightarrow$ Rainy, and +6.3 mIoU for Cityscapes $\rightarrow$ ACDC. Our approach adds minimal memory during training and has no additional latency at inference time. Please see Appendix for more results and analysis.
This paper introduces Spatial Diagrammatic Instructions (SDIs), an approach for human operators to specify objectives and constraints that are related to spatial regions in the working environment. Human operators are enabled to sketch out regions directly on camera images that correspond to the objectives and constraints. These sketches are projected to 3D spatial coordinates, and continuous Spatial Instruction Maps (SIMs) are learned upon them. These maps can then be integrated into optimization problems for tasks of robots. In particular, we demonstrate how Spatial Diagrammatic Instructions can be applied to solve the Base Placement Problem of mobile manipulators, which concerns the best place to put the manipulator to facilitate a certain task. Human operators can specify, via sketch, spatial regions of interest for a manipulation task and permissible regions for the mobile manipulator to be at. Then, an optimization problem that maximizes the manipulator's reachability, or coverage, over the designated regions of interest while remaining in the permissible regions is solved. We provide extensive empirical evaluations, and show that our formulation of Spatial Instruction Maps provides accurate representations of user-specified diagrammatic instructions. Furthermore, we demonstrate that our diagrammatic approach to the Mobile Base Placement Problem enables higher quality solutions and faster run-time.
Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf.
Lithic Use-Wear Analysis (LUWA) using microscopic images is an underexplored vision-for-science research area. It seeks to distinguish the worked material, which is critical for understanding archaeological artifacts, material interactions, tool functionalities, and dental records. However, this challenging task goes beyond the well-studied image classification problem for common objects. It is affected by many confounders owing to the complex wear mechanism and microscopic imaging, which makes it difficult even for human experts to identify the worked material successfully. In this paper, we investigate the following three questions on this unique vision task for the first time:(i) How well can state-of-the-art pre-trained models (like DINOv2) generalize to the rarely seen domain? (ii) How can few-shot learning be exploited for scarce microscopic images? (iii) How do the ambiguous magnification and sensing modality influence the classification accuracy? To study these, we collaborated with archaeologists and built the first open-source and the largest LUWA dataset containing 23,130 microscopic images with different magnifications and sensing modalities. Extensive experiments show that existing pre-trained models notably outperform human experts but still leave a large gap for improvements. Most importantly, the LUWA dataset provides an underexplored opportunity for vision and learning communities and complements existing image classification problems on common objects.
Soldiers in the field often need to cross negative obstacles, such as rivers or canyons, to reach goals or safety. Military gap crossing involves on-site temporary bridges construction. However, this procedure is conducted with dangerous, time and labor intensive operations, and specialized machinery. We envision a scalable robotic solution inspired by advancements in force-controlled and Cable Driven Parallel Robots (CDPRs); this solution can address the challenges inherent in this transportation problem, achieving fast, efficient, and safe deployment and field operations. We introduce the embodied vision in Co3MaNDR, a solution to the military gap crossing problem, a distributed robot consisting of several modules simultaneously pulling on a central payload, controlling the cables' tensions to achieve complex objectives, such as precise trajectory tracking or force amplification. Hardware experiments demonstrate teleoperation of a payload, trajectory following, and the sensing and amplification of operators' applied physical forces during slow operations. An operator was shown to manipulate a 27.2 kg (60 lb) payload with an average force utilization of 14.5\% of its weight. Results indicate that the system can be scaled up to heavier payloads without compromising performance or introducing superfluous complexity. This research lays a foundation to expand CDPR technology to uncoordinated and unstable mobile platforms in unknown environments.
Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We propose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-art performance when applied on the PaLI3-5B VLM by \citet{chen2023pali3}, while also enabling much better performance on PlotQA and FigureQA. We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by \citet{liu2023deplot}. We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by \citet{hsieh2023distilling}. Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt \cite{chen2023program}, our model outperforms the recently introduced Gemini Ultra and GPT-4V.
Time-frequency representation (TFR) is often used for non-stationary signal analysis. The most intuitive and interpretable TFR is the spectrogram. Recently, a concept of non-negative matrix factorization (NMF) has been successfully applied to local damage detection in rolling elements of bearings via spectrogram factorization. NMF applied to the spectrogram allows one to find an informative frequency band, which could be further used as a filter characteristic. However, the obtained filter characteristics mostly detect the informative frequency band, which also encompasses a lot of noise. In the case where noise is more problematic, as is the case for acoustic signals from industrial machines, the NMF hardly detects the damage. To solve this problem and obtain more selective filters, which are more robust to noise, we propose the non-negative matrix under-approximation (NMU) as an informative frequency band selector. Due to the more sparse parts-based representation of the NMU compared to NMF, NMU provides more selective filter characteristics, which neglect the non-informative frequency bands related to the noise. In practice, it means that NMU gives a better signal-to-noise ratio for the filtered signal. The efficiency of the proposed approach has been validated on the vibration signal from the test rig and the acoustic signal from an idler.
Events describe the state changes of entities. In a document, multiple events are connected by various relations (e.g., Coreference, Temporal, Causal, and Subevent). Therefore, obtaining the connections between events through Event-Event Relation Extraction (ERE) is critical to understand natural language. There are two main problems in the current ERE works: a. Only embeddings of the event triggers are used for event feature representation, ignoring event arguments (e.g., time, place, person, etc.) and their structure within the event. b. The interconnection between relations (e.g., temporal and causal relations usually interact with each other ) is ignored. To solve the above problems, this paper proposes a jointly multiple ERE framework called GraphERE based on Graph-enhanced Event Embeddings. First, we enrich the event embeddings with event argument and structure features by using static AMR graphs and IE graphs; Then, to jointly extract multiple event relations, we use Node Transformer and construct Task-specific Dynamic Event Graphs for each type of relation. Finally, we used a multi-task learning strategy to train the whole framework. Experimental results on the latest MAVEN-ERE dataset validate that GraphERE significantly outperforms existing methods. Further analyses indicate the effectiveness of the graph-enhanced event embeddings and the joint extraction strategy.
Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.