In this work, we present MotionMixer, an efficient 3D human body pose forecasting model based solely on multi-layer perceptrons (MLPs). MotionMixer learns the spatial-temporal 3D body pose dependencies by sequentially mixing both modalities. Given a stacked sequence of 3D body poses, a spatial-MLP extracts fine grained spatial dependencies of the body joints. The interaction of the body joints over time is then modelled by a temporal MLP. The spatial-temporal mixed features are finally aggregated and decoded to obtain the future motion. To calibrate the influence of each time step in the pose sequence, we make use of squeeze-and-excitation (SE) blocks. We evaluate our approach on Human3.6M, AMASS, and 3DPW datasets using the standard evaluation protocols. For all evaluations, we demonstrate state-of-the-art performance, while having a model with a smaller number of parameters. Our code is available at: https://github.com/MotionMLP/MotionMixer
Designing recommendation systems that serve content aligned with time varying preferences requires proper accounting of the feedback effects of recommendations on human behavior and psychological condition. We argue that modeling the influence of recommendations on people's preferences must be grounded in psychologically plausible models. We contribute a methodology for developing grounded dynamic preference models. We demonstrate this method with models that capture three classic effects from the psychology literature: Mere-Exposure, Operant Conditioning, and Hedonic Adaptation. We conduct simulation-based studies to show that the psychological models manifest distinct behaviors that can inform system design. Our study has two direct implications for dynamic user modeling in recommendation systems. First, the methodology we outline is broadly applicable for psychologically grounding dynamic preference models. It allows us to critique recent contributions based on their limited discussion of psychological foundation and their implausible predictions. Second, we discuss implications of dynamic preference models for recommendation systems evaluation and design. In an example, we show that engagement and diversity metrics may be unable to capture desirable recommendation system performance.
Real-world time series data often present recurrent or repetitive patterns and it is often generated in real time, such as transportation passenger volume, network traffic, system resource consumption, energy usage, and human gait. Detecting anomalous events based on machine learning approaches in such time series data has been an active research topic in many different areas. However, most machine learning approaches require labeled datasets, offline training, and may suffer from high computation complexity, consequently hindering their applicability. Providing a lightweight self-adaptive approach that does not need offline training in advance and meanwhile is able to detect anomalies in real time could be highly beneficial. Such an approach could be immediately applied and deployed on any commodity machine to provide timely anomaly alerts. To facilitate such an approach, this paper introduces SALAD, which is a Self-Adaptive Lightweight Anomaly Detection approach based on a special type of recurrent neural networks called Long Short-Term Memory (LSTM). Instead of using offline training, SALAD converts a target time series into a series of average absolute relative error (AARE) values on the fly and predicts an AARE value for every upcoming data point based on short-term historical AARE values. If the difference between a calculated AARE value and its corresponding forecast AARE value is higher than a self-adaptive detection threshold, the corresponding data point is considered anomalous. Otherwise, the data point is considered normal. Experiments based on two real-world open-source time series datasets demonstrate that SALAD outperforms five other state-of-the-art anomaly detection approaches in terms of detection accuracy. In addition, the results also show that SALAD is lightweight and can be deployed on a commodity machine.
The conversation scenario is one of the most important and most challenging scenarios for speech processing technologies because people in conversation respond to each other in a casual style. Detecting the speech activities of each person in a conversation is vital to downstream tasks, like natural language processing, machine translation, etc. People refer to the detection technology of "who speak when" as speaker diarization (SD). Traditionally, diarization error rate (DER) has been used as the standard evaluation metric of SD systems for a long time. However, DER fails to give enough importance to short conversational phrases, which are short but important on the semantic level. Also, a carefully and accurately manually-annotated testing dataset suitable for evaluating the conversational SD technologies is still unavailable in the speech community. In this paper, we design and describe the Conversational Short-phrases Speaker Diarization (CSSD) task, which consists of training and testing datasets, evaluation metric and baselines. In the dataset aspect, despite the previously open-sourced 180-hour conversational MagicData-RAMC dataset, we prepare an individual 20-hour conversational speech test dataset with carefully and artificially verified speakers timestamps annotations for the CSSD task. In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level. In the baseline aspect, we adopt a commonly used method: Variational Bayes HMM x-vector system, as the baseline of the CSSD task. Our evaluation metric is publicly available at https://github.com/SpeechClub/CDER_Metric.
We consider the problem of reforming an envy-free matching when each agent is assigned a single item. Given an envy-free matching, we consider an operation to exchange the item of an agent with an unassigned item preferred by the agent that results in another envy-free matching. We repeat this operation as long as we can. We prove that the resulting envy-free matching is uniquely determined up to the choice of an initial envy-free matching, and can be found in polynomial time. We call the resulting matching a reformist envy-free matching, and then we study a shortest sequence to obtain the reformist envy-free matching from an initial envy-free matching. We prove that a shortest sequence is computationally hard to obtain even when each agent accepts at most four items and each item is accepted by at most three agents. On the other hand, we give polynomial-time algorithms when each agent accepts at most three items or each item is accepted by at most two agents. Inapproximability and fixed-parameter (in)tractability are also discussed.
Besides improving communication performance, intelligent reflecting surfaces (IRSs) are also promising enablers for achieving larger sensing coverage and enhanced sensing quality. Nevertheless, in the absence of a direct path between the base station (BS) and the targets, multi-target sensing is generally very difficult, since IRSs are incapable of proactively transmitting sensing beams or analyzing target information. Moreover, the echoes of different targets reflected via the IRS-established virtual links share the same directionality at the BS. In this paper, we study a wireless system comprising a multi-antenna BS and an IRS for multi-target sensing, where the beamforming vector and the IRS phase shifts are jointly optimized to improve the sensing performance. To meet the different sensing requirements, such as a minimum received power and a minimum sensing frequency, we propose three novel IRS-assisted sensing schemes: Time division (TD) sensing, signature sequence (SS) sensing, and hybrid TD-SS sensing. First, for TD sensing, the sensing tasks are performed in sequence over time. Subsequently, a novel signature sequence (SS) sensing scheme is proposed to improve sensing efficiency by establishing a relationship between directions and SSs. To strike a flexible balance between the beam pattern gain and sensing efficiency, we also propose a general hybrid TD-SS sensing scheme with target grouping, where targets belonging to the same group are sensed simultaneously via SS sensing, while the targets in different groups are assigned to orthogonal time slots. By controlling the number of groups, the hybrid TD-SS sensing scheme can provide a more flexible balance between beam pattern gain and sensing frequency. Moreover, ...
Time series forecasting is essential for a wide range of real-world applications. Recent studies have shown the superiority of Transformer in dealing with such problems, especially long sequence time series input(LSTI) and long sequence time series forecasting(LSTF) problems. To improve the efficiency and enhance the locality of Transformer, these studies combine Transformer with CNN in varying degrees. However, their combinations are loosely-coupled and do not make full use of CNN. To address this issue, we propose the concept of tightly-coupled convolutional Transformer(TCCT) and three TCCT architectures which apply transformed CNN architectures into Transformer: (1) CSPAttention: through fusing CSPNet with self-attention mechanism, the computation cost of self-attention mechanism is reduced by 30% and the memory usage is reduced by 50% while achieving equivalent or beyond prediction accuracy. (2) Dilated causal convolution: this method is to modify the distilling operation proposed by Informer through replacing canonical convolutional layers with dilated causal convolutional layers to gain exponentially receptive field growth. (3) Passthrough mechanism: the application of passthrough mechanism to stack of self-attention blocks helps Transformer-like models get more fine-grained information with negligible extra computation costs. Our experiments on real-world datasets show that our TCCT architectures could greatly improve the performance of existing state-of-art Transformer models on time series forecasting with much lower computation and memory costs, including canonical Transformer, LogTrans and Informer.
Terahertz (THz) wireless networks are expected to catalyze the beyond fifth generation (B5G) era. However, due to the directional nature and the line-of-sight demand of THz links, as well as the ultra-dense deployment of THz networks, a number of challenges that the medium access control (MAC) layer needs to face are created. In more detail, the need of rethinking user association and resource allocation strategies by incorporating artificial intelligence (AI) capable of providing "real-time" solutions in complex and frequently changing environments becomes evident. Moreover, to satisfy the ultra-reliability and low-latency demands of several B5G applications, novel mobility management approaches are required. Motivated by this, this article presents a holistic MAC layer approach that enables intelligent user association and resource allocation, as well as flexible and adaptive mobility management, while maximizing systems' reliability through blockage minimization. In more detail, a fast and centralized joint user association, radio resource allocation, and blockage avoidance by means of a novel metaheuristic-machine learning framework is documented, that maximizes the THz networks performance, while minimizing the association latency by approximately three orders of magnitude. To support, within the access point (AP) coverage area, mobility management and blockage avoidance, a deep reinforcement learning (DRL) approach for beam-selection is discussed. Finally, to support user mobility between coverage areas of neighbor APs, a proactive hand-over mechanism based on AI-assisted fast channel prediction is~reported.
Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful interaction and communication in the physical world. One such reasoning task is to describe the position of a target object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions. At the same time, model training requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA architectures trained on GRiD-A-3D. We demonstrate that within a few epochs, the subtasks required to reason over relative directions, such as recognizing and locating objects in a scene and estimating their intrinsic orientations, are learned in the order in which relative directions are intuitively processed.
The widespread adoption of Electric Vehicles (EVs) is limited by their reliance on batteries with presently low energy and power densities compared to liquid fuels and are subject to aging and performance deterioration over time. For this reason, monitoring the battery State Of Charge (SOC) and State Of Health (SOH) during the EV lifetime is a very relevant problem. This work proposes a battery digital twin structure designed to accurately reflect battery dynamics at the run time. To ensure a high degree of correctness concerning non-linear phenomena, the digital twin relies on data-driven models trained on traces of battery evolution over time: a SOH model, repeatedly executed to estimate the degradation of maximum battery capacity, and a SOC model, retrained periodically to reflect the impact of aging. The proposed digital twin structure will be exemplified on a public dataset to motivate its adoption and prove its effectiveness, with high accuracy and inference and retraining times compatible with onboard execution.