Characterizing the neural encoding of behavior remains a challenging task in many research areas due in part to complex and noisy spatiotemporal dynamics of evoked brain activity. An important aspect of modeling these neural encodings involves separation of robust, behaviorally relevant signals from background activity, which often contains signals from irrelevant brain processes and decaying information from previous behavioral events. To achieve this separation, we develop a two-branch State Space Variational AutoEncoder (SSVAE) model to individually describe the instantaneous evoked foreground signals and the context-dependent background signals. We modeled the spontaneous speech-evoked brain dynamics using smoothed Gaussian mixture models. By applying the proposed SSVAE model to track ECoG dynamics in one participant over multiple hours, we find that the model can predict speech-related dynamics more accurately than other latent factor inference algorithms. Our results demonstrate that separately modeling the instantaneous speech-evoked and slow context-dependent brain dynamics can enhance tracking performance, which has important implications for the development of advanced neural encoding and decoding models in various neuroscience sub-disciplines.