Timely cancer reporting data are required in order to understand the impact of cancer, inform public health resource planning and implement cancer policy especially in Sub Saharan Africa where the reporting lag is behind world averages. Unstructured pathology reports, which contain tumor specific data, are the main source of information collected by cancer registries. Due to manual processing and labelling of pathology reports using the International Classification of Disease for oncology (ICD-O) codes, by human coders employed by cancer registries, has led to a considerable lag in cancer reporting. We present a hierarchical deep learning classification method that employs convolutional neural network models to automate the classification of 1813 anonymized breast cancer pathology reports with applicable ICD-O morphology codes across 9 classes. We demonstrate that the hierarchical deep learning classification method improves on performance in comparison to a flat multiclass CNN model for ICD-O morphology classification of the same reports.
Like most global cancer registries, the National Cancer Registry in South Africa employs expert human coders to label pathology reports using appropriate International Classification of Disease for Oncology (ICD-O) codes spanning 42 different cancer types. The annotation is extensive for the large volume of cancer pathology reports the registry receives annually from public and private sector institutions. This manual process, coupled with other challenges results in a significant 4-year lag in reporting of annual cancer statistics in South Africa. We present a hierarchical deep learning ensemble method incorporating state of the art convolutional neural network models for the automatic labelling of 2201 de-identified, free text pathology reports, with appropriate ICD-O breast cancer topography codes across 8 classes. Our results show an improvement in primary site classification over the state of the art CNN model by greater than 14% for F1 micro and 55% for F1 macro scores. We demonstrate that the hierarchical deep learning ensemble improves on state-of-the-art models for ICD-O topography classification in comparison to a flat multiclass model for predicting ICD-O topography codes for pathology reports.