The Segment Anything Model (SAM) is a powerful vision foundation model that is revolutionizing the traditional paradigm of segmentation. Despite this, a reliance on prompting each frame and large computational cost limit its usage in robotically assisted surgery. Applications, such as augmented reality guidance, require little user intervention along with efficient inference to be usable clinically. In this study, we address these limitations by adopting lightweight SAM variants to meet the speed requirement and employing fine-tuning techniques to enhance their generalization in surgical scenes. Recent advancements in Tracking Any Point (TAP) have shown promising results in both accuracy and efficiency, particularly when points are occluded or leave the field of view. Inspired by this progress, we present a novel framework that combines an online point tracker with a lightweight SAM model that is fine-tuned for surgical instrument segmentation. Sparse points within the region of interest are tracked and used to prompt SAM throughout the video sequence, providing temporal consistency. The quantitative results surpass the state-of-the-art semi-supervised video object segmentation method on the EndoVis 2015 dataset, with an over 25 FPS inference speed running on a single GeForce RTX 4060 GPU.
The math abilities of large language models can represent their abstract reasoning ability. In this paper, we introduce and open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2. We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format and supervise our model to be a versatile math reasoner, verifier, prover, and augmenter. These abilities can be used to develop the next math LLMs or self-iteration. InternLM-Math obtains open-sourced state-of-the-art performance under the setting of in-context learning, supervised fine-tuning, and code-assisted reasoning in various informal and formal benchmarks including GSM8K, MATH, Hungary math exam, MathBench-ZH, and MiniF2F. Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning. We further explore how to use LEAN to solve math problems and study its performance under the setting of multi-task learning which shows the possibility of using LEAN as a unified platform for solving and proving in math. Our models, codes, and data are released at \url{https://github.com/InternLM/InternLM-Math}.
Temporal event forecasting aims to predict what will happen next given the observed events in history. Previous formulations of temporal event are unstructured, atomic, or lacking full temporal information, thus largely restricting the representation quality and forecasting ability of temporal events. To address these limitations, we introduce a novel formulation for Structured, Complex, and Time-complete Temporal Event (SCTc-TE). Based on this new formulation, we develop a simple and fully automated pipeline for constructing such SCTc-TEs from a large amount of news articles. Furthermore, we propose a novel model that leverages both Local and Global contexts for SCTc-TE forecasting, named LoGo. To evaluate our model, we construct two large-scale datasets named MidEast-TE and GDELT-TE. Extensive evaluations demonstrate the advantages of our datasets in multiple aspects, while experimental results justify the effectiveness of our forecasting model LoGo. We release the code and dataset via https://github.com/yecchen/GDELT-ComplexEvent.
Event forecasting has been a demanding and challenging task throughout the entire human history. It plays a pivotal role in crisis alarming and disaster prevention in various aspects of the whole society. The task of event forecasting aims to model the relational and temporal patterns based on historical events and makes forecasting to what will happen in the future. Most existing studies on event forecasting formulate it as a problem of link prediction on temporal event graphs. However, such pure structured formulation suffers from two main limitations: 1) most events fall into general and high-level types in the event ontology, and therefore they tend to be coarse-grained and offers little utility which inevitably harms the forecasting accuracy; and 2) the events defined by a fixed ontology are unable to retain the out-of-ontology contextual information. To address these limitations, we propose a novel task of context-aware event forecasting which incorporates auxiliary contextual information. First, the categorical context provides supplementary fine-grained information to the coarse-grained events. Second and more importantly, the context provides additional information towards specific situation and condition, which is crucial or even determinant to what will happen next. However, it is challenging to properly integrate context into the event forecasting framework, considering the complex patterns in the multi-context scenario. Towards this end, we design a novel framework named Separation and Collaboration Graph Disentanglement (short as SeCoGD) for context-aware event forecasting. Since there is no available dataset for this novel task, we construct three large-scale datasets based on GDELT. Experimental results demonstrate that our model outperforms a list of SOTA methods.
Real-time transrectal ultrasound (TRUS) image guidance during robot-assisted laparoscopic radical prostatectomy has the potential to enhance surgery outcomes. Whether conventional or photoacoustic TRUS is used, the robotic system and the TRUS must be registered to each other. Accurate registration can be performed using photoacoustic (PA markers). However, this requires a manual search by an assistant [19]. This paper introduces the first automatic search for PA markers using a transrectal ultrasound robot. This effectively reduces the challenges associated with the da Vinci-TRUS registration. This paper investigated the performance of three search algorithms in simulation and experiment: Weighted Average (WA), Golden Section Search (GSS), and Ternary Search (TS). For validation, a surgical prostate scenario was mimicked and various ex vivo tissues were tested. As a result, the WA algorithm can achieve 0.53 degree average error after 9 data acquisitions, while the TS and GSS algorithm can achieve 0.29 degree and 0.48 degree average errors after 28 data acquisitions.
Purpose: To achieve effective robot-assisted laparoscopic prostatectomy, the integration of transrectal ultrasound (TRUS) imaging system which is the most widely used imaging modelity in prostate imaging is essential. However, manual manipulation of the ultrasound transducer during the procedure will significantly interfere with the surgery. Therefore, we propose an image co-registration algorithm based on a photoacoustic marker method, where the ultrasound / photoacoustic (US/PA) images can be registered to the endoscopic camera images to ultimately enable the TRUS transducer to automatically track the surgical instrument Methods: An optimization-based algorithm is proposed to co-register the images from the two different imaging modalities. The principles of light propagation and an uncertainty in PM detection were assumed in this algorithm to improve the stability and accuracy of the algorithm. The algorithm is validated using the previously developed US/PA image-guided system with a da Vinci surgical robot. Results: The target-registration-error (TRE) is measured to evaluate the proposed algorithm. In both simulation and experimental demonstration, the proposed algorithm achieved a sub-centimeter accuracy which is acceptable in practical clinics. The result is also comparable with our previous approach, and the proposed method can be implemented with a normal white light stereo camera and doesn't require highly accurate localization of the PM. Conclusion: The proposed frame registration algorithm enabled a simple yet efficient integration of commercial US/PA imaging system into laparoscopic surgical setting by leveraging the characteristic properties of acoustic wave propagation and laser excitation, contributing to automated US/PA image-guided surgical intervention applications.