The Wasserstein distance has emerged as a key metric to quantify distances between probability distributions, with applications in various fields, including machine learning, control theory, decision theory, and biological systems. Consequently, learning an unknown distribution with non-asymptotic and easy-to-compute error bounds in Wasserstein distance has become a fundamental problem in many fields. In this paper, we devise a novel algorithmic and theoretical framework to approximate an unknown probability distribution $\mathbb{P}$ from a finite set of samples by an approximate discrete distribution $\widehat{\mathbb{P}}$ while bounding the Wasserstein distance between $\mathbb{P}$ and $\widehat{\mathbb{P}}$. Our framework leverages optimal transport, nonlinear optimization, and concentration inequalities. In particular, we show that, even if $\mathbb{P}$ is unknown, the Wasserstein distance between $\mathbb{P}$ and $\widehat{\mathbb{P}}$ can be efficiently bounded with high confidence by solving a tractable optimization problem (a mixed integer linear program) of a size that only depends on the size of the support of $\widehat{\mathbb{P}}$. This enables us to develop intelligent clustering algorithms to optimally find the support of $\widehat{\mathbb{P}}$ while minimizing the Wasserstein distance error. On a set of benchmarks, we demonstrate that our approach outperforms state-of-the-art comparable methods by generally returning approximating distributions with substantially smaller support and tighter error bounds.