Abstract:When constructing portfolios, a key problem is that a lot of financial time series data are sparse, making it challenging to apply machine learning methods. Polymodel theory can solve this issue and demonstrate superiority in portfolio construction from various aspects. To implement the PolyModel theory for constructing a hedge fund portfolio, we begin by identifying an asset pool, utilizing over 10,000 hedge funds for the past 29 years' data. PolyModel theory also involves choosing a wide-ranging set of risk factors, which includes various financial indices, currencies, and commodity prices. This comprehensive selection mirrors the complexities of the real-world environment. Leveraging on the PolyModel theory, we create quantitative measures such as Long-term Alpha, Long-term Ratio, and SVaR. We also use more classical measures like the Sharpe ratio or Morningstar's MRAR. To enhance the performance of the constructed portfolio, we also employ the latest deep learning techniques (iTransformer) to capture the upward trend, while efficiently controlling the downside, using all the features. The iTransformer model is specifically designed to address the challenges in high-dimensional time series forecasting and could largely improve our strategies. More precisely, our strategies achieve better Sharpe ratio and annualized return. The above process enables us to create multiple portfolio strategies aiming for high returns and low risks when compared to various benchmarks.
Abstract:Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf, introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.
Abstract:In this work, we propose a novel method named \textbf{Auto}mated Process Labeling via \textbf{C}onfidence \textbf{V}ariation (\textbf{\textsc{AutoCV}}) to enhance the reasoning capabilities of large language models (LLMs) by automatically annotating the reasoning steps. Our approach begins by training a verification model on the correctness of final answers, enabling it to generate automatic process annotations. This verification model assigns a confidence score to each reasoning step, indicating the probability of arriving at the correct final answer from that point onward. We detect relative changes in the verification's confidence scores across reasoning steps to automatically annotate the reasoning process. This alleviates the need for numerous manual annotations or the high computational costs associated with model-induced annotation approaches. We experimentally validate that the confidence variations learned by the verification model trained on the final answer correctness can effectively identify errors in the reasoning steps. Subsequently, we demonstrate that the process annotations generated by \textsc{AutoCV} can improve the accuracy of the verification model in selecting the correct answer from multiple outputs generated by LLMs. Notably, we achieve substantial improvements across five datasets in mathematics and commonsense reasoning. The source code of \textsc{AutoCV} is available at \url{https://github.com/rookie-joe/AUTOCV}.
Abstract:Generative pre-trained large language models (LLMs) have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources we can leverage for pre-training. Federated learning (FL) has the potential to unleash the majority of the planet's data and computational resources, which are underutilized by the data-center-focused training methodology of current LLM practice. Our work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs. This would mobilize more computational and data resources while matching or potentially exceeding centralized performance. We further show the effectiveness of the federated training scales with model size and present our approach for training a billion-scale federated LLM using limited resources. This will help data-rich actors to become the protagonists of LLMs pre-training instead of leaving the stage to compute-rich actors alone.
Abstract:We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Although previous motion diffusion models have shown impressive results, they struggle to achieve fast generation while maintaining high-quality human motions. Motion latent diffusion has been proposed for efficient motion generation. However, effectively learning a latent space can be non-trivial in such a two-stage manner. Meanwhile, accelerating motion sampling by increasing the step size, e.g., DDIM, typically leads to a decline in motion quality due to the inapproximation of complex data distributions when naively increasing the step size. In this paper, we propose EMDM that allows for much fewer sample steps for fast motion generation by modeling the complex denoising distribution during multiple sampling steps. Specifically, we develop a Conditional Denoising Diffusion GAN to capture multimodal data distributions conditioned on both control signals, i.e., textual description and denoising time step. By modeling the complex data distribution, a larger sampling step size and fewer steps are achieved during motion synthesis, significantly accelerating the generation process. To effectively capture the human dynamics and reduce undesired artifacts, we employ motion geometric loss during network training, which improves the motion quality and training efficiency. As a result, EMDM achieves a remarkable speed-up at the generation stage while maintaining high-quality motion generation in terms of fidelity and diversity.
Abstract:In the U.S., corn is the most produced crop and has been an essential part of the American diet. To meet the demand for supply chain management and regional food security, accurate and timely large-scale corn yield prediction is attracting more attention in precision agriculture. Recently, remote sensing technology and machine learning methods have been widely explored for crop yield prediction. Currently, most county-level yield prediction models use county-level mean variables for prediction, ignoring much detailed information. Moreover, inconsistent spatial resolution between crop area and satellite sensors results in mixed pixels, which may decrease the prediction accuracy. Only a few works have addressed the mixed pixels problem in large-scale crop yield prediction. To address the information loss and mixed pixels problem, we developed a variational autoencoder (VAE) based multiple instance regression (MIR) model for large-scaled corn yield prediction. We use all unlabeled data to train a VAE and the well-trained VAE for anomaly detection. As a preprocess method, anomaly detection can help MIR find a better representation of every bag than traditional MIR methods, thus better performing in large-scale corn yield prediction. Our experiments showed that variational autoencoder based multiple instance regression (VAEMIR) outperformed all baseline methods in large-scale corn yield prediction. Though a suitable meta parameter is required, VAEMIR shows excellent potential in feature learning and extraction for large-scale corn yield prediction.
Abstract:In this paper, we introduce a set of effective TOken REduction (TORE) strategies for Transformer-based Human Mesh Recovery from monocular images. Current SOTA performance is achieved by Transformer-based structures. However, they suffer from high model complexity and computation cost caused by redundant tokens. We propose token reduction strategies based on two important aspects, i.e., the 3D geometry structure and 2D image feature, where we hierarchically recover the mesh geometry with priors from body structure and conduct token clustering to pass fewer but more discriminative image feature tokens to the Transformer. As a result, our method vastly reduces the number of tokens involved in high-complexity interactions in the Transformer, achieving competitive accuracy of shape recovery at a significantly reduced computational cost. We conduct extensive experiments across a wide range of benchmarks to validate the proposed method and further demonstrate the generalizability of our method on hand mesh recovery. Our code will be publicly available once the paper is published.
Abstract:This paper presents a self-supervised feature learning method for hyperspectral image classification. Our method tries to construct two different views of the raw hyperspectral image through a cross-representation learning method. And then to learn semantically consistent representation over the created views by contrastive learning method. Specifically, four cross-channel-prediction based augmentation methods are naturally designed to utilize the high dimension characteristic of hyperspectral data for the view construction. And the better representative features are learned by maximizing mutual information and minimizing conditional entropy across different views from our contrastive network. This 'Cross-View-Predicton' style is straightforward and gets the state-of-the-art performance of unsupervised classification with a simple SVM classifier.
Abstract:Unsupervised learning methods for feature extraction are becoming more and more popular. We combine the popular contrastive learning method (prototypical contrastive learning) and the classic representation learning method (autoencoder) to design an unsupervised feature learning network for hyperspectral classification. Experiments have proved that our two proposed autoencoder networks have good feature learning capabilities by themselves, and the contrastive learning network we designed can better combine the features of the two to learn more representative features. As a result, our method surpasses other comparison methods in the hyperspectral classification experiments, including some supervised methods. Moreover, our method maintains a fast feature extraction speed than baseline methods. In addition, our method reduces the requirements for huge computing resources, separates feature extraction and contrastive learning, and allows more researchers to conduct research and experiments on unsupervised contrastive learning.