Designing a robust affinity model is the key issue in multiple target tracking (MTT). This paper proposes a novel affinity model by learning feature representation and distance metric jointly in a unified deep architecture. Specifically, we design a CNN network to obtain appearance cue tailored towards person Re-ID, and an LSTM network for motion cue to predict target position, respectively. Both cues are combined with a triplet loss function, which performs end-to-end learning of the fused features in a desired embedding space. Experiments in the challenging MOT benchmark demonstrate, that even by a simple Linear Assignment strategy fed with affinity scores of our method, very competitive results are achieved when compared with the most recent state-of-theart approaches.