Underwater image enhancement is an important low-level vision task with many applications, and numerous algorithms have been proposed in recent years. Despite the demonstrated success, these results are often generated based on different assumptions using different datasets and metrics. In this paper, we propose a large-scale Realistic Underwater Image Enhancement (RUIE) dataset, in which all degraded images are divided into multiple sub-datasets according to natural underwater image quality evaluation metric and the degree of color deviation. Compared with exiting testing or training sets of realistic underwater scenes, the RUIE dataset contains three sub-datasets, which are specifically selected and classified for the experiment of non-reference image quality evaluation, color deviation and task-driven detection. Based on RUIE, we conduct extensive and systematic experiments to evaluate the effectiveness and limitations of various algorithms, on images with hierarchical classification of degradation. Our evaluation and analysis demonstrate the performance and limitations of state-of-the-art algorithms. The findings from these experiments not only confirm what is commonly believed, but also suggest new research directions. More importantly, we recognize that underwater image enhancement in practice usually serves as the preprocessing step for mid-level and high-level vision tasks. We thus propose to exploit the object detection performance on the enhanced images as a brand-new `task-specific' evaluation criterion for underwater image enhancement algorithms.