Abstract:Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as $\texttt{csv}$, $\texttt{tsv}$, $\texttt{html}$, $\texttt{markdown}$, and $\texttt{ddl}$, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across $\texttt{MPNet}$, $\texttt{BGE-M3}$, $\texttt{ReasonIR}$, and $\texttt{SPLADE}$. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval.




Abstract:Table Question Answering (TQA) involves retrieving relevant tables from a large corpus to answer natural language queries. Traditional dense retrieval models, such as DTR and ColBERT, not only incur high computational costs for large-scale retrieval tasks but also require retraining or fine-tuning on new datasets, limiting their adaptability to evolving domains and knowledge. In this work, we propose $\textbf{CRAFT}$, a cascaded retrieval approach that first uses a sparse retrieval model to filter a subset of candidate tables before applying more computationally expensive dense models and neural re-rankers. Our approach achieves better retrieval performance than state-of-the-art (SOTA) sparse, dense, and hybrid retrievers. We further enhance table representations by generating table descriptions and titles using Gemini Flash 1.5. End-to-end TQA results using various Large Language Models (LLMs) on NQ-Tables, a subset of the Natural Questions Dataset, demonstrate $\textbf{CRAFT}$ effectiveness.
Abstract:The development of novel pharmaceuticals represents a significant challenge in modern science, with substantial costs and time investments. Deep generative models have emerged as promising tools for accelerating drug discovery by efficiently exploring the vast chemical space. However, this rapidly evolving field lacks standardized evaluation protocols, impeding fair comparison between approaches. This research presents an extensive analysis of the Molecular Sets (MOSES) platform, a comprehensive benchmarking framework designed to standardize evaluation of deep generative models in molecular design. Through rigorous assessment of multiple generative architectures, including recurrent neural networks, variational autoencoders, and generative adversarial networks, we examine their capabilities in generating valid, unique, and novel molecular structures while maintaining specific chemical properties. Our findings reveal that different architectures exhibit complementary strengths across various metrics, highlighting the complex trade-offs between exploration and exploitation in chemical space. This study provides detailed insights into the current state of the art in molecular generation and establishes a foundation for future advancements in AI-driven drug discovery.