Computerized document classification already orders the news articles that Apple's "News" app or Google's "personalized search" feature groups together to match a reader's interests. The invisible and therefore illegible decisions that go into these tailored searches have been the subject of a critique by scholars who emphasize that our intelligence about documents is only as good as our ability to understand the criteria of search. This article will attempt to unpack the procedures used in computational classification of texts, translating them into term legible to humanists, and examining opportunities to render the computational text classification process subject to expert critique and improvement.
Despite the growing availability of big data in many fields, historical data on socioevironmental phenomena are often not available due to a lack of automated and scalable approaches for collecting, digitizing, and assembling them. We have developed a data-mining method for extracting tabulated, geocoded data from printed directories. While scanning and optical character recognition (OCR) can digitize printed text, these methods alone do not capture the structure of the underlying data. Our pipeline integrates both page layout analysis and OCR to extract tabular, geocoded data from structured text. We demonstrate the utility of this method by applying it to scanned manufacturing registries from Rhode Island that record 41 years of industrial land use. The resulting spatio-temporal data can be used for socioenvironmental analyses of industrialization at a resolution that was not previously possible. In particular, we find strong evidence for the dispersion of manufacturing from the urban core of Providence, the state's capital, along the Interstate 95 corridor to the north and south.