In recent years, multitudes of researches have applied deep learning to automatic sleep stage classification. Whereas actually, these works have paid less attention to the issue of cross-subject in sleep staging. At the same time, emerging neuroscience theories on inter-subject correlations can provide new insights for cross-subject analysis. This paper presents the MViTime model that have been used in sleep staging study. And we implement the inter-subject correlation theory through contrastive learning, providing a feasible solution to address the cross-subject problem in sleep stage classification. Finally, experimental results and conclusions are presented, demonstrating that the developed method has achieved state-of-the-art performance on sleep staging. The results of the ablation experiment also demonstrate the effectiveness of the cross-subject approach based on contrastive learning.
Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.
Instance segmentation of point clouds is a crucial task in 3D field with numerous applications that involve localizing and segmenting objects in a scene. However, achieving satisfactory results requires a large number of manual annotations, which is a time-consuming and expensive process. To alleviate dependency on annotations, we propose a method, called FreePoint, for underexplored unsupervised class-agnostic instance segmentation on point clouds. In detail, we represent the point features by combining coordinates, colors, normals, and self-supervised deep features. Based on the point features, we perform a multicut algorithm to segment point clouds into coarse instance masks as pseudo labels, which are used to train a point cloud instance segmentation model. To alleviate the inaccuracy of coarse masks during training, we propose a weakly-supervised training strategy and corresponding loss. Our work can also serve as an unsupervised pre-training pretext for supervised semantic instance segmentation with limited annotations. For class-agnostic instance segmentation on point clouds, FreePoint largely fills the gap with its fully-supervised counterpart based on the state-of-the-art instance segmentation model Mask3D and even surpasses some previous fully-supervised methods. When serving as a pretext task and fine-tuning on S3DIS, FreePoint outperforms training from scratch by 5.8% AP with only 10% mask annotations.
Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples are available at https://comospeech.github.io/.
Vibration-based condition monitoring systems are receiving increasing attention due to their ability to accurately identify different conditions by capturing dynamic features over a broad frequency range. However, there is little research on clustering approaches in vibration data and the resulting solutions are often optimized for a single data set. In this work, we present an extensive comparison of the clustering algorithms K-means clustering, OPTICS, and Gaussian mixture model clustering (GMM) applied to statistical features extracted from the time and frequency domains of vibration data sets. Furthermore, we investigate the influence of feature combinations, feature selection using principal component analysis (PCA), and the specified number of clusters on the performance of the clustering algorithms. We conducted this comparison in terms of a grid search using three different benchmark data sets. Our work showed that averaging (Mean, Median) and variance-based features (Standard Deviation, Interquartile Range) performed significantly better than shape-based features (Skewness, Kurtosis). In addition, K-means outperformed GMM slightly for these data sets, whereas OPTICS performed significantly worse. We were also able to show that feature combinations as well as PCA feature selection did not result in any significant performance improvements. With an increase in the specified number of clusters, clustering algorithms performed better, although there were some specific algorithmic restrictions.
Federated learning (FL), which addresses data privacy issues by training models on resource-constrained mobile devices in a distributed manner, has attracted significant research attention. However, the problem of optimizing FL client selection in mobile federated learning networks (MFLNs), where devices move in and out of each others' coverage and no FL server knows all the data owners, remains open. To bridge this gap, we propose a first-of-its-kind \underline{Soc}ially-aware \underline{Fed}erated \underline{C}lient \underline{S}election (SocFedCS) approach to minimize costs and train high-quality FL models. SocFedCS enriches the candidate FL client pool by enabling data owners to propagate FL task information through their local networks of trust, even as devices are moving into and out of each others' coverage. Based on Lyapunov optimization, we first transform this time-coupled problem into a step-by-step optimization problem. Then, we design a method based on alternating minimization and self-adaptive global best harmony search to solve this mixed-integer optimization problem. Extensive experiments comparing SocFedCS against five state-of-the-art approaches based on four real-world multimedia datasets demonstrate that it achieves 2.06\% higher test accuracy and 12.24\% lower cost on average than the best-performing baseline.
Socially assistive robots are increasingly being explored to improve the engagement of older adults and people with disability in health and well-being-related exercises. However, even if people have various physical conditions, most prior work on social robot exercise coaching systems has utilized generic, predefined feedback. The deployment of these systems still remains a challenge. In this paper, we present our work of iteratively engaging therapists and post-stroke survivors to design, develop, and evaluate a social robot exercise coaching system for personalized rehabilitation. Through interviews with therapists, we designed how this system interacts with the user and then developed an interactive social robot exercise coaching system. This system integrates a neural network model with a rule-based model to automatically monitor and assess patients' rehabilitation exercises and can be tuned with individual patient's data to generate real-time, personalized corrective feedback for improvement. With the dataset of rehabilitation exercises from 15 post-stroke survivors, we demonstrated our system significantly improves its performance to assess patients' exercises while tuning with held-out patient's data. In addition, our real-world evaluation study showed that our system can adapt to new participants and achieved 0.81 average performance to assess their exercises, which is comparable to the experts' agreement level. We further discuss the potential benefits and limitations of our system in practice.
Nowadays, people use electricity in all aspects of their lives so that electricity consumption increases gradually. There can be wastage of electricity due to various reasons, such as human negligence, daylighting, etc. Hence, conservation of energy is the need of the day. This paper deals with the fabrication of an "Automated Power Conservation System (APCS)" that has multiple benefits like saving on power consumption there by saving on electricity bills of the organization, eliminating human involvement and manpower which is often required to manually toggle the lights and electrical devices on/off, and last but most importantly conserve the precious natural resources by reducing electrical energy consumption. Two IR sensors are used in this project and these two sensors are used for detecting the presence of a person in the classroom. When the existence of the person is detected by the APCS it automatically turns on the fans and lights in that classroom and during the absence they will be automatically turned off, thus paving the easiest way to conserve power. This hardware is integrated with the Android app, where the user can get data on his smartphone regarding the number of fans and lights that are turned on at a particular instance of time. The user can also switch on/off the fans and lights from anywhere in the world by using the Android App.
Accurate, real-time measurements of price index changes using electronic records are essential for tracking inflation and productivity in today's economic environment. We develop empirical hedonic models that can process large amounts of unstructured product data (text, images, prices, quantities) and output accurate hedonic price estimates and derived indices. To accomplish this, we generate abstract product attributes, or ``features,'' from text descriptions and images using deep neural networks, and then use these attributes to estimate the hedonic price function. Specifically, we convert textual information about the product to numeric features using large language models based on transformers, trained or fine-tuned using product descriptions, and convert the product image to numeric features using a residual network model. To produce the estimated hedonic price function, we again use a multi-task neural network trained to predict a product's price in all time periods simultaneously. To demonstrate the performance of this approach, we apply the models to Amazon's data for first-party apparel sales and estimate hedonic prices. The resulting models have high predictive accuracy, with $R^2$ ranging from $80\%$ to $90\%$. Finally, we construct the AI-based hedonic Fisher price index, chained at the year-over-year frequency. We contrast the index with the CPI and other electronic indices.
In order to advance underwater computer vision and robotics from lab environments and clear water scenarios to the deep dark ocean or murky coastal waters, representative benchmarks and realistic datasets with ground truth information are required. In particular, determining the camera pose is essential for many underwater robotic or photogrammetric applications and known ground truth is mandatory to evaluate the performance of e.g., simultaneous localization and mapping approaches in such extreme environments. This paper presents the conception, calibration and implementation of an external reference system for determining the underwater camera pose in real-time. The approach, based on an HTC Vive tracking system in air, calculates the underwater camera pose by fusing the poses of two controllers tracked above the water surface of a tank. It is shown that the mean deviation of this approach to an optical marker based reference in air is less than 3 mm and 0.3{\deg}. Finally, the usability of the system for underwater applications is demonstrated.