We propose Polyffusion, a diffusion model that generates polyphonic music scores by regarding music as image-like piano roll representations. The model is capable of controllable music generation with two paradigms: internal control and external control. Internal control refers to the process in which users pre-define a part of the music and then let the model infill the rest, similar to the task of masked music generation (or music inpainting). External control conditions the model with external yet related information, such as chord, texture, or other features, via the cross-attention mechanism. We show that by using internal and external controls, Polyffusion unifies a wide range of music creation tasks, including melody generation given accompaniment, accompaniment generation given melody, arbitrary music segment inpainting, and music arrangement given chords or textures. Experimental results show that our model significantly outperforms existing Transformer and sampling-based baselines, and using pre-trained disentangled representations as external conditions yields more effective controls.
Graph learning has a wide range of applications in many scenarios, which require more need for data privacy. Federated learning is an emerging distributed machine learning approach that leverages data from individual devices or data centers to improve the accuracy and generalization of the model, while also protecting the privacy of user data. Graph-federated learning is mainly based on the classical federated learning framework i.e., the Client-Server framework. However, the Client-Server framework faces problems such as a single point of failure of the central server and poor scalability of network topology. First, we introduce the decentralized framework to graph-federated learning. Second, determine the confidence among nodes based on the similarity of data among nodes, subsequently, the gradient information is then aggregated by linear weighting based on confidence. Finally, the proposed method is compared with FedAvg, Fedprox, GCFL, and GCFL+ to verify the effectiveness of the proposed method. Experiments demonstrate that the proposed method outperforms other methods.
With social media usage growing exponentially in the past few years, fake news has also become extremely prevalent. The detrimental impact of fake news emphasizes the need for research focused on automating the detection of false information and verifying its accuracy. In this work, we present the outcome of the Factify 2 shared task, which provides a multi-modal fact verification and satire news dataset, as part of the DeFactify 2 workshop at AAAI'23. The data calls for a comparison based approach to the task by pairing social media claims with supporting documents, with both text and image, divided into 5 classes based on multi-modal relations. In the second iteration of this task we had over 60 participants and 9 final test-set submissions. The best performances came from the use of DeBERTa for text and Swinv2 and CLIP for image. The highest F1 score averaged for all five classes was 81.82%.
Large offline learning-based models have enabled robots to successfully interact with objects for a wide variety of tasks. However, these models rely on fairly consistent structured environments. For more unstructured environments, an online learning component is necessary to gather and estimate information about objects in the environment in order to successfully interact with them. Unfortunately, online learning methods like Bayesian non-parametric models struggle with changes in the environment, which is often the desired outcome of interaction-based tasks. We propose using an object-centric representation for interactive online learning. This representation is generated by transforming the robot's actions into the object's coordinate frame. We demonstrate how switching to this task-relevant space improves our ability to reason with the training data collected online, enabling scalable online learning of robot-object interactions. We showcase our method by successfully navigating a manipulator arm through an environment with multiple unknown objects without violating interaction-based constraints.
Extracting workflow nets from textual descriptions can be used to simplify guidelines or formalize textual descriptions of formal processes like business processes and algorithms. The task of manually extracting processes, however, requires domain expertise and effort. While automatic process model extraction is desirable, annotating texts with formalized process models is expensive. Therefore, there are only a few machine-learning-based extraction approaches. Rule-based approaches, in turn, require domain specificity to work well and can rarely distinguish relevant and irrelevant information in textual descriptions. In this paper, we present GUIDO, a hybrid approach to the process model extraction task that first, classifies sentences regarding their relevance to the process model, using a BERT-based sentence classifier, and second, extracts a process model from the sentences classified as relevant, using dependency parsing. The presented approach achieves significantly better results than a pure rule-based approach. GUIDO achieves an average behavioral similarity score of $0.93$. Still, in comparison to purely machine-learning-based approaches, the annotation costs stay low.
Item Response Theory (IRT) is a well known method for assessing responses from humans in education and psychology. In education, IRT is used to infer student abilities and characteristics of test items from student responses. Interactions with students are expensive, calling for methods that efficiently gather information for inferring student abilities. Methods based on Optimal Experimental Design (OED) are computationally costly, making them inapplicable for interactive applications. In response, we propose incorporating amortised experimental design into IRT. Here, the computational cost is shifted to a precomputing phase by training a Deep Reinforcement Learning (DRL) agent with synthetic data. The agent is trained to select optimally informative test items for the distribution of students, and to conduct amortised inference conditioned on the experiment outcomes. During deployment the agent estimates parameters from data, and suggests the next test item for the student, in close to real-time, by taking into account the history of experiments and outcomes.
Autonomous driving requires accurate local scene understanding information. To this end, autonomous agents deploy object detection and online BEV lane graph extraction methods as a part of their perception stack. In this work, we propose an architecture and loss formulation to improve the accuracy of local lane graph estimates by using 3D object detection outputs. The proposed method learns to assign the objects to centerlines by considering the centerlines as cluster centers and the objects as data points to be assigned a probability distribution over the cluster centers. This training scheme ensures direct supervision on the relationship between lanes and objects, thus leading to better performance. The proposed method improves lane graph estimation substantially over state-of-the-art methods. The extensive ablations show that our method can achieve significant performance improvements by using the outputs of existing 3D object detection methods. Since our method uses the detection outputs rather than detection method intermediate representations, a single model of our method can use any detection method at test time.
Event-based cameras have become increasingly popular for tracking fast-moving objects due to their high temporal resolution, low latency, and high dynamic range. In this paper, we propose a novel algorithm for tracking event blobs using raw events asynchronously in real time. We introduce the concept of an event blob as a spatio-temporal likelihood of event occurrence where the conditional spatial likelihood is blob-like. Many real-world objects generate event blob data, for example, flickering LEDs such as car headlights or any small foreground object moving against a static or slowly varying background. The proposed algorithm uses a nearest neighbour classifier with a dynamic threshold criteria for data association coupled with a Kalman filter to track the event blob state. Our algorithm achieves highly accurate tracking and event blob shape estimation even under challenging lighting conditions and high-speed motions. The microsecond time resolution achieved means that the filter output can be used to derive secondary information such as time-to-contact or range estimation, that will enable applications to real-world problems such as collision avoidance in autonomous driving.
Colonoscopic Polyp Re-Identification aims to match a specific polyp in a large gallery with different cameras and views, which plays a key role for the prevention and treatment of colorectal cancer in the computer-aided diagnosis. However, traditional methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which may easily leads to poor generalization capability when adapted the pretrained model into the new scenarios. To relieve this dilemma, we propose a simple but effective training method named VT-ReID, which can remarkably enrich the representation of polyp videos with the interchange of high-level semantic information. Moreover, we elaborately design a novel clustering mechanism to introduce prior knowledge from textual data, which leverages contrastive learning to promote better separation from abundant unlabeled text data. To the best of our knowledge, this is the first attempt to employ the visual-text feature with clustering mechanism for the colonoscopic polyp re-identification. Empirical results show that our method significantly outperforms current state-of-the art methods with a clear margin.
In frequency division duplexing (FDD) cell-free massive MIMO, the acquisition of the channel state information (CSI) is very challenging because of the large overhead required for the training and feedback of the downlink channels of multiple cooperating base stations (BSs). In this paper, for systems with partial uplink-downlink channel reciprocity, and a general spatial domain channel model with variations in the average port power and correlation among port coefficients, we propose a joint-port-selection-based CSI acquisition and feedback scheme for the downlink transmission with zero-forcing precoding. The scheme uses an eigenvalue-decomposition-based transformation to reduce the feedback overhead by exploring the port correlation. We derive the sum-rate of the system for any port selection. Based on the sum-rate result, we propose a low-complexity greedy-search-based joint port selection (GS-JPS) algorithm. Moreover, to adapt to fast time-varying scenarios, a supervised deep learning-enhanced joint port selection (DL-JPS) algorithm is proposed. Simulations verify the effectiveness of our proposed schemes and their advantage over existing port-selection channel acquisition schemes.