Alert button
Picture for Jakub Sido

Jakub Sido

Alert button

ElectroCardioGuard: Preventing Patient Misidentification in Electrocardiogram Databases through Neural Networks

Jun 09, 2023
Michal Seják, Jakub Sido, David Žahour

Figure 1 for ElectroCardioGuard: Preventing Patient Misidentification in Electrocardiogram Databases through Neural Networks
Figure 2 for ElectroCardioGuard: Preventing Patient Misidentification in Electrocardiogram Databases through Neural Networks
Figure 3 for ElectroCardioGuard: Preventing Patient Misidentification in Electrocardiogram Databases through Neural Networks
Figure 4 for ElectroCardioGuard: Preventing Patient Misidentification in Electrocardiogram Databases through Neural Networks

Electrocardiograms (ECGs) are commonly used by cardiologists to detect heart-related pathological conditions. Reliable collections of ECGs are crucial for precise diagnosis. However, in clinical practice, the assignment of captured ECG recordings to incorrect patients can occur inadvertently. In collaboration with a clinical and research facility which recognized this challenge and reached out to us, we present a study that addresses this issue. In this work, we propose a small and efficient neural-network based model for determining whether two ECGs originate from the same patient. Our model demonstrates great generalization capabilities and achieves state-of-the-art performance in gallery-probe patient identification on PTB-XL while utilizing 760x fewer parameters. Furthermore, we present a technique leveraging our model for detection of recording-assignment mistakes, showcasing its applicability in a realistic scenario. Finally, we evaluate our model on a newly collected ECG dataset specifically curated for this study, and make it public for the research community.

* 22 pages, 4 figures, 6 tables 
Viaarxiv icon

Findings of the Shared Task on Multilingual Coreference Resolution

Sep 16, 2022
Zdeněk Žabokrtský, Miloslav Konopík, Anna Nedoluzhko, Michal Novák, Maciej Ogrodniczuk, Martin Popel, Ondřej Pražák, Jakub Sido, Daniel Zeman, Yilun Zhu

Figure 1 for Findings of the Shared Task on Multilingual Coreference Resolution
Figure 2 for Findings of the Shared Task on Multilingual Coreference Resolution
Figure 3 for Findings of the Shared Task on Multilingual Coreference Resolution
Figure 4 for Findings of the Shared Task on Multilingual Coreference Resolution

This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages).

Viaarxiv icon

MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain

Mar 29, 2022
Jan Pašek, Jakub Sido, Miloslav Konopík, Ondřej Pražák

Figure 1 for MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
Figure 2 for MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
Figure 3 for MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain
Figure 4 for MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain

This work proposes a new pipeline for leveraging data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.

Viaarxiv icon

Czech News Dataset for Semantic Textual Similarity

Aug 23, 2021
Jakub Sido, Michal Seják, Ondřej Pražák, Miloslav Konopík, Václav Moravec

Figure 1 for Czech News Dataset for Semantic Textual Similarity
Figure 2 for Czech News Dataset for Semantic Textual Similarity
Figure 3 for Czech News Dataset for Semantic Textual Similarity
Figure 4 for Czech News Dataset for Semantic Textual Similarity

This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlation coefficients).

Viaarxiv icon

Czech News Dataset for Semanic Textual Similarity

Aug 19, 2021
Jakub Sido, Michal Seják, Ondřej Pražák, Miloslav Konopík, Václav Moravec

Figure 1 for Czech News Dataset for Semanic Textual Similarity
Figure 2 for Czech News Dataset for Semanic Textual Similarity
Figure 3 for Czech News Dataset for Semanic Textual Similarity
Figure 4 for Czech News Dataset for Semanic Textual Similarity

This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlation coefficients).

Viaarxiv icon

Multilingual Coreference Resolution with Harmonized Annotations

Jul 26, 2021
Ondřej Pražák, Miloslav Konopík, Jakub Sido

Figure 1 for Multilingual Coreference Resolution with Harmonized Annotations
Figure 2 for Multilingual Coreference Resolution with Harmonized Annotations
Figure 3 for Multilingual Coreference Resolution with Harmonized Annotations

In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD. We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan. In addition to monolingual experiments, we combine the training data in multilingual experiments and train two joined models -- for Slavic languages and for all the languages together. We rely on an end-to-end deep learning model that we slightly adapted for the CorefUD corpus. Our results show that we can profit from harmonized annotations, and using joined models helps significantly for the languages with smaller training data.

Viaarxiv icon

Czert -- Czech BERT-like Model for Language Representation

Mar 24, 2021
Jakub Sido, Ondřej Pražák, Pavel Přibáň, Jan Pašek, Michal Seják, Miloslav Konopík

Figure 1 for Czert -- Czech BERT-like Model for Language Representation
Figure 2 for Czert -- Czech BERT-like Model for Language Representation
Figure 3 for Czert -- Czech BERT-like Model for Language Representation
Figure 4 for Czert -- Czech BERT-like Model for Language Representation

This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 7 out of 10 datasets. In addition, we establish the new state-of-the-art results on seven datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.

* 13 pages 
Viaarxiv icon

UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection

Nov 30, 2020
Ondřej Pražák, Pavel Přibáň, Stephen Taylor, Jakub Sido

Figure 1 for UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection
Figure 2 for UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection
Figure 3 for UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection
Figure 4 for UWB at SemEval-2020 Task 1: Lexical Semantic Change Detection

In this paper, we describe our method for the detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: \textit{Unsupervised Lexical Semantic Change Detection.} We ranked $1^{st}$ in Sub-task 1: binary change detection, and $4^{th}$ in Sub-task 2: ranked change detection. Our method is fully unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and Orthogonal Transformation; and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.

* arXiv admin note: substantial text overlap with arXiv:2011.14678 
Viaarxiv icon