A key challenge in large-scale image retrieval problems is the trade-off between scalability and accuracy. Recent research has made great strides to improve scalability with compact global image features, and accuracy with local image features. In this work, our main contribution is to unify global and local image features into a single deep model, enabling scalable retrieval with high accuracy. We refer to the new model as DELG, standing for DEep Local and Global features. We leverage lessons from recent feature learning work and propose a model that combines generalized mean pooling for global features and attentive selection for local features. The entire network can be learned end-to-end by carefully balancing the gradient flow between two heads -- requiring only image-level labels. We also introduce an autoencoder-based dimensionality reduction technique for local features, which is integrated into the model, improving training efficiency and matching performance. Experiments on the Revisited Oxford and Paris datasets demonstrate that our jointly learned ResNet-50 based features outperform all previous results using deep global features (most with heavier backbones) and those that further re-rank with local features. Code and models will be released.