Alert button
Picture for Ralph Ewerth

Ralph Ewerth

Alert button

Classification of Visualization Types and Perspectives in Patents

Jul 19, 2023
Junaid Ahmed Ghauri, Eric Müller-Budack, Ralph Ewerth

Figure 1 for Classification of Visualization Types and Perspectives in Patents
Figure 2 for Classification of Visualization Types and Perspectives in Patents
Figure 3 for Classification of Visualization Types and Perspectives in Patents
Figure 4 for Classification of Visualization Types and Perspectives in Patents

Due to the swift growth of patent applications each year, information and multimedia retrieval approaches that facilitate patent exploration and retrieval are of utmost importance. Different types of visualizations (e.g., graphs, technical drawings) and perspectives (e.g., side view, perspective) are used to visualize details of innovations in patents. The classification of these images enables a more efficient search and allows for further analysis. So far, datasets for image type classification miss some important visualization types for patents. Furthermore, related work does not make use of recent deep learning approaches including transformers. In this paper, we adopt state-of-the-art deep learning methods for the classification of visualization types and perspectives in patent images. We extend the CLEF-IP dataset for image type classification in patents to ten classes and provide manual ground truth annotations. In addition, we derive a set of hierarchical classes from a dataset that provides weakly-labeled data for image perspectives. Experimental results have demonstrated the feasibility of the proposed approaches. Source code, models, and dataset will be made publicly available.

* Accepted in International Conference on Theory and Practice of Digital Libraries (TPDL) 2023 (They have the copyright to publish camera-ready version of this work) 
Viaarxiv icon

Improving Generalization for Multimodal Fake News Detection

May 29, 2023
Sahar Tahmasebi, Sherzod Hakimov, Ralph Ewerth, Eric Müller-Budack

Figure 1 for Improving Generalization for Multimodal Fake News Detection
Figure 2 for Improving Generalization for Multimodal Fake News Detection
Figure 3 for Improving Generalization for Multimodal Fake News Detection
Figure 4 for Improving Generalization for Multimodal Fake News Detection

The increasing proliferation of misinformation and its alarming impact have motivated both industry and academia to develop approaches for fake news detection. However, state-of-the-art approaches are usually trained on datasets of smaller size or with a limited set of specific topics. As a consequence, these models lack generalization capabilities and are not applicable to real-world data. In this paper, we propose three models that adopt and fine-tune state-of-the-art multimodal transformers for multimodal fake news detection. We conduct an in-depth analysis by manipulating the input data aimed to explore models performance in realistic use cases on social media. Our study across multiple models demonstrates that these systems suffer significant performance drops against manipulated data. To reduce the bias and improve model generalization, we suggest training data augmentation to conduct more meaningful experiments for fake news detection on social media. The proposed data augmentation techniques enable models to generalize better and yield improved state-of-the-art results.

* This paper has been accepted for ICMR 2023 
Viaarxiv icon

Predicting Knowledge Gain for MOOC Video Consumption

Dec 13, 2022
Christian Otto, Markos Stamatakis, Anett Hoppe, Ralph Ewerth

Informal learning on the Web using search engines as well as more structured learning on MOOC platforms have become very popular in recent years. As a result of the vast amount of available learning resources, intelligent retrieval and recommendation methods are indispensable -- this is true also for MOOC videos. However, the automatic assessment of this content with regard to predicting (potential) knowledge gain has not been addressed by previous work yet. In this paper, we investigate whether we can predict learning success after MOOC video consumption using 1) multimodal features covering slide and speech content, and 2) a wide range of text-based features describing the content of the video. In a comprehensive experimental setting, we test four different classifiers and various feature subset combinations. We conduct a detailed feature importance analysis to gain insights in which modality benefits knowledge gain prediction the most.

* AIED 2022. Lecture Notes in Computer Science, vol 13356, pp. 458-462  
* 13 pages, 1 figure, 3 tables 
Viaarxiv icon

MM-Locate-News: Multimodal Focus Location Estimation in News

Nov 15, 2022
Golsa Tahmasebzadeh, Eric Müller-Budack, Sherzod Hakimov, Ralph Ewerth

Figure 1 for MM-Locate-News: Multimodal Focus Location Estimation in News
Figure 2 for MM-Locate-News: Multimodal Focus Location Estimation in News
Figure 3 for MM-Locate-News: Multimodal Focus Location Estimation in News
Figure 4 for MM-Locate-News: Multimodal Focus Location Estimation in News

The consumption of news has changed significantly as the Web has become the most influential medium for information. To analyze and contextualize the large amount of news published every day, the geographic focus of an article is an important aspect in order to enable content-based news retrieval. There are methods and datasets for geolocation estimation from text or photos, but they are typically considered as separate tasks. However, the photo might lack geographical cues and text can include multiple locations, making it challenging to recognize the focus location using a single modality. In this paper, a novel dataset called Multimodal Focus Location of News (MM-Locate-News) is introduced. We evaluate state-of-the-art methods on the new benchmark dataset and suggest novel models to predict the focus location of news using both textual and image content. The experimental results show that the multimodal model outperforms unimodal models.

Viaarxiv icon

SoccerNet 2022 Challenges Results

Oct 05, 2022
Silvio Giancola, Anthony Cioppa, Adrien Deliège, Floriane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, Bernard Ghanem, Marc Van Droogenbroeck, Abdulrahman Darwish, Adrien Maglo, Albert Clapés, Andreas Luyts, Andrei Boiarov, Artur Xarles, Astrid Orcesi, Avijit Shah, Baoyu Fan, Bharath Comandur, Chen Chen, Chen Zhang, Chen Zhao, Chengzhi Lin, Cheuk-Yiu Chan, Chun Chuen Hui, Dengjie Li, Fan Yang, Fan Liang, Fang Da, Feng Yan, Fufu Yu, Guanshuo Wang, H. Anthony Chan, He Zhu, Hongwei Kan, Jiaming Chu, Jianming Hu, Jianyang Gu, Jin Chen, João V. B. Soares, Jonas Theiner, Jorge De Corte, José Henrique Brito, Jun Zhang, Junjie Li, Junwei Liang, Leqi Shen, Lin Ma, Lingchi Chen, Miguel Santos Marques, Mike Azatov, Nikita Kasatkin, Ning Wang, Qiong Jia, Quoc Cuong Pham, Ralph Ewerth, Ran Song, Rengang Li, Rikke Gade, Ruben Debien, Runze Zhang, Sangrok Lee, Sergio Escalera, Shan Jiang, Shigeyuki Odashima, Shimin Chen, Shoichi Masui, Shouhong Ding, Sin-wai Chan, Siyu Chen, Tallal El-Shabrawy, Tao He, Thomas B. Moeslund, Wan-Chi Siu, Wei Zhang, Wei Li, Xiangwei Wang, Xiao Tan, Xiaochuan Li, Xiaolin Wei, Xiaoqing Ye, Xing Liu, Xinying Wang, Yandong Guo, Yaqian Zhao, Yi Yu, Yingying Li, Yue He, Yujie Zhong, Zhenhua Guo, Zhiheng Li

Figure 1 for SoccerNet 2022 Challenges Results
Figure 2 for SoccerNet 2022 Challenges Results
Figure 3 for SoccerNet 2022 Challenges Results
Figure 4 for SoccerNet 2022 Challenges Results

The SoccerNet 2022 challenges were the second annual video understanding challenges organized by the SoccerNet team. In 2022, the challenges were composed of 6 vision-based tasks: (1) action spotting, focusing on retrieving action timestamps in long untrimmed videos, (2) replay grounding, focusing on retrieving the live moment of an action shown in a replay, (3) pitch localization, focusing on detecting line and goal part elements, (4) camera calibration, dedicated to retrieving the intrinsic and extrinsic camera parameters, (5) player re-identification, focusing on retrieving the same players across multiple views, and (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams. Compared to last year's challenges, tasks (1-2) had their evaluation metrics redefined to consider tighter temporal accuracies, and tasks (3-6) were novel, including their underlying data and annotations. More information on the tasks, challenges and leaderboards are available on https://www.soccer-net.org. Baselines and development kits are available on https://github.com/SoccerNet.

* Accepted at ACM MMSports 2022 
Viaarxiv icon

Keypoint-less Camera Calibration for Sports Field Registration in Soccer

Jul 24, 2022
Jonas Theiner, Ralph Ewerth

Figure 1 for Keypoint-less Camera Calibration for Sports Field Registration in Soccer
Figure 2 for Keypoint-less Camera Calibration for Sports Field Registration in Soccer
Figure 3 for Keypoint-less Camera Calibration for Sports Field Registration in Soccer
Figure 4 for Keypoint-less Camera Calibration for Sports Field Registration in Soccer

Sports field registration in broadcast videos is typically interpreted as the task of homography estimation, which provides a mapping between a planar field and the corresponding visible area of the image. In contrast to previous approaches, we consider the task as a camera calibration problem. First, we introduce a differentiable objective function that is able to learn the camera pose and focal length from segment correspondences (e.g., lines, point clouds), based on pixel-level annotations for segments of a known calibration object, i.e., the sports field. The calibration module iteratively minimizes the segment reprojection error induced by the estimated camera parameters. Second, we propose a novel approach for 3D sports field registration from broadcast soccer images. The calibration module does not require any training data and compared to the typical solution, which subsequently refines an initial estimation, our solution does it in one step. The proposed method is evaluated for sports field registration on two datasets and achieves superior results compared to two state-of-the-art approaches.

Viaarxiv icon

Semi-supervised Human Pose Estimation in Art-historical Images

Jul 11, 2022
Matthias Springstein, Stefanie Schneider, Christian Althaus, Ralph Ewerth

Figure 1 for Semi-supervised Human Pose Estimation in Art-historical Images
Figure 2 for Semi-supervised Human Pose Estimation in Art-historical Images
Figure 3 for Semi-supervised Human Pose Estimation in Art-historical Images
Figure 4 for Semi-supervised Human Pose Estimation in Art-historical Images

Gesture as language of non-verbal communication has been theoretically established since the 17th century. However, its relevance for the visual arts has been expressed only sporadically. This may be primarily due to the sheer overwhelming amount of data that traditionally had to be processed by hand. With the steady progress of digitization, though, a growing number of historical artifacts have been indexed and made available to the public, creating a need for automatic retrieval of art-historical motifs with similar body constellations or poses. Since the domain of art differs significantly from existing real-world data sets for human pose estimation due to its style variance, this presents new challenges. In this paper, we propose a novel approach to estimate human poses in art-historical images. In contrast to previous work that attempts to bridge the domain gap with pre-trained models or through style transfer, we suggest semi-supervised learning for both object and keypoint detection. Furthermore, we introduce a novel domain-specific art data set that includes both bounding box and keypoint annotations of human figures. Our approach achieves significantly better results than methods that use pre-trained models or style transfer.

Viaarxiv icon

MM-Claims: A Dataset for Multimodal Claim Detection in Social Media

May 04, 2022
Gullal S. Cheema, Sherzod Hakimov, Abdul Sittar, Eric Müller-Budack, Christian Otto, Ralph Ewerth

Figure 1 for MM-Claims: A Dataset for Multimodal Claim Detection in Social Media
Figure 2 for MM-Claims: A Dataset for Multimodal Claim Detection in Social Media
Figure 3 for MM-Claims: A Dataset for Multimodal Claim Detection in Social Media
Figure 4 for MM-Claims: A Dataset for Multimodal Claim Detection in Social Media

In recent years, the problem of misinformation on the web has become widespread across languages, countries, and various social media platforms. Although there has been much work on automated fake news detection, the role of images and their variety are not well explored. In this paper, we investigate the roles of image and text at an earlier stage of the fake news detection pipeline, called claim detection. For this purpose, we introduce a novel dataset, MM-Claims, which consists of tweets and corresponding images over three topics: COVID-19, Climate Change and broadly Technology. The dataset contains roughly 86000 tweets, out of which 3400 are labeled manually by multiple annotators for the training and evaluation of multimodal models. We describe the dataset in detail, evaluate strong unimodal and multimodal baselines, and analyze the potential and drawbacks of current models.

* Accepted to Findings of NAACL 2022 
Viaarxiv icon

TIB-VA at SemEval-2022 Task 5: A Multimodal Architecture for the Detection and Classification of Misogynous Memes

Apr 13, 2022
Sherzod Hakimov, Gullal S. Cheema, Ralph Ewerth

Figure 1 for TIB-VA at SemEval-2022 Task 5: A Multimodal Architecture for the Detection and Classification of Misogynous Memes
Figure 2 for TIB-VA at SemEval-2022 Task 5: A Multimodal Architecture for the Detection and Classification of Misogynous Memes
Figure 3 for TIB-VA at SemEval-2022 Task 5: A Multimodal Architecture for the Detection and Classification of Misogynous Memes
Figure 4 for TIB-VA at SemEval-2022 Task 5: A Multimodal Architecture for the Detection and Classification of Misogynous Memes

The detection of offensive, hateful content on social media is a challenging problem that affects many online users on a daily basis. Hateful content is often used to target a group of people based on ethnicity, gender, religion and other factors. The hate or contempt toward women has been increasing on social platforms. Misogynous content detection is especially challenging when textual and visual modalities are combined to form a single context, e.g., an overlay text embedded on top of an image, also known as meme. In this paper, we present a multimodal architecture that combines textual and visual features in order to detect misogynous meme content. The proposed architecture is evaluated in the SemEval-2022 Task 5: MAMI - Multimedia Automatic Misogyny Identification challenge under the team name TIB-VA. Our solution obtained the best result in the Task-B where the challenge is to classify whether a given document is misogynous and further identify the main sub-classes of shaming, stereotype, objectification, and violence.

* Accepted for publication at SemEval-2022 Workshop, Task 5: MAMI - Multimedia Automatic Misogyny Identification co-located with NAACL 2022 
Viaarxiv icon