Abstract:Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.




Abstract:Computational protein design has the potential to deliver novel molecular structures, binders, and catalysts for myriad applications. Recent neural graph-based models that use backbone coordinate-derived features show exceptional performance on native sequence recovery tasks and are promising frameworks for design. A statistical framework for modeling protein sequence landscapes using Tertiary Motifs (TERMs), compact units of recurring structure in proteins, has also demonstrated good performance on protein design tasks. In this work, we investigate the use of TERM-derived data as features in neural protein design frameworks. Our graph-based architecture, TERMinator, incorporates TERM-based and coordinate-based information and outputs a Potts model over sequence space. TERMinator outperforms state-of-the-art models on native sequence recovery tasks, suggesting that utilizing TERM-based and coordinate-based features together is beneficial for protein design.