Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Carla Teixeira Lopes

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Nov 27, 2023

Mariana Dias, Carla Teixeira Lopes

Figure 1 for Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Figure 2 for Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Figure 3 for Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Figure 4 for Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Abstract:Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.

* J. Comput. Cult. Herit. 16, 4, Article 77 (December 2023), 25 pages
* 25 pages, 4 figures

Via

Access Paper or Ask Questions

Automatic Quality Assessment of Wikipedia Articles -- A Systematic Literature Review

Oct 03, 2023

Pedro Miguel Moás, Carla Teixeira Lopes

Abstract:Wikipedia is the world's largest online encyclopedia, but maintaining article quality through collaboration is challenging. Wikipedia designed a quality scale, but with such a manual assessment process, many articles remain unassessed. We review existing methods for automatically measuring the quality of Wikipedia articles, identifying and comparing machine learning algorithms, article features, quality metrics, and used datasets, examining 149 distinct studies, and exploring commonalities and gaps in them. The literature is extensive, and the approaches follow past technological trends. However, machine learning is still not widely used by Wikipedia, and we hope that our analysis helps future researchers change that reality.

* 37 pages, 10 figures, just accepted in ACM Computing Surveys (September 2023). This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Computing Surveys, https://dx.doi.org/10.1145/3625286

Via

Access Paper or Ask Questions

Health Information Retrieval -- State of the art report

May 18, 2022

Carla Teixeira Lopes

Abstract:This report provides an overview of the field of Information Retrieval (IR) in healthcare. It does not aim to introduce general concepts and theories of IR but to present and describe specific aspects of Health Information Retrieval (HIR). After a brief introduction to the more broader field of IR, the significance of HIR at current times is discussed. Specific characteristics of Health Information, its classification and the main existing representations for health concepts are described together with the main products and services in the area (e.g.: databases of health bibliographic content, health specific search engines and others). Recent research work is discussed and the most active researchers, projects and research groups are also presented. Main organizations and journals are also identified.

* 38 pages, 0 figures

Via

Access Paper or Ask Questions