Latest Tabular Data Research Papers
The newest Tabular Data papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Tabular Data so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Tabular Data papers in your inbox — free →Recent papers
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and ReasoningZhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou et al. · arXiv · Jun 1, 2026
The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answerin…
- TabPrep: Closing the Feature Engineering Gap in Tabular BenchmarksAndrej Tschalzev, Nick Erickson, Yuyang Wang, Huzefa Rangwala et al. · arXiv · Jun 1, 2026
Progress in tabular machine learning has largely focused on increasingly sophisticated model architectures. At the same time, feature engineering remains a critical yet underexplored component of real-world modeling pipelines that is entire…
- Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular DatasetsM. Ross Kunz, John Merickel, Keith Wilson · arXiv · May 28, 2026
Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches ei…
- LUCoS: Latent Unsupervised Context Selection for Tabular Foundation ModelsOroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, Isaac Triguero · arXiv · May 26, 2026
Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, context selection directly determines predictive performance. Supervised oracle experiments show that …
- Agentic Data Intelligence for General Tabular ModelingJun-Peng Jiang, An-Yang Ji, Jia-Yi Zhu, Han-Jia Ye · FMSD @ ICML 2026 Poster · May 25, 2026
Tabular data is one of the most common forms for organizing real-world information, supporting diverse tasks such as prediction, reasoning, querying, and generation. Despite rapid progress in tabular learning, most existing methods remain s…
- A Generative Foundation Model for Heterogeneous Tabular DataXiangjian Jiang, Mingxuan Liu, Nikola Simidjievski, Tassilo Klein et al. · FMSD @ ICML 2026 Poster · May 25, 2026
Generative modelling is a demanding test of foundation models, because it requires robust, holistic representation learning for a given data modality, rather than optimisation for a supervised prediction target alone. While recent work on t…
- Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in TreesChristian Janos Lebeda, David Erb, Tudor Cebere, Aurélien Bellet · arXiv · May 21, 2026
Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, …
- Distilling Tabular Foundation Models for Structured Health DataAditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay Kumar Sankarapu et al. · arXiv · May 18, 2026
Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabul…
- Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration TrapAditya Tanna, Yash Desai, Pratinav Seth, Mohamed Bouadi et al. · arXiv · May 18, 2026
Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six mod…
- V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy PredictionMarcin Kostrzewa, Sebastian Tomczak, Roman Furman, Anna Poberezhna et al. · arXiv · May 11, 2026
Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain betwee…
- Physiologically Grounded Driver Behavior Classification: SHAP-Driven Elite Feature Selection and Hybrid Gradient Boosting for Multimodal Physiological SignalsSahar Askari, Mohammad Mahdi Mirza Ali Mohammadi, Fatemeh Ensafdoust, Amin Golnari et al. · arXiv · May 6, 2026
An interpretable and scalable framework for decoding driving behaviors from multimodal physiological signals is proposed in this study. We utilize multimodal physiological driving behavior large-scale dataset comprising synchronized electro…
- TabSurv: Adapting Modern Tabular Neural Networks to Survival AnalysisStanislav Kirpichenko, Andrei Konstantinov, Lev Utkin · arXiv · May 5, 2026
Survival analysis on tabular data is a well-studied problem. However, existing deep learning methods are often highly task-specific, which can limit the transfer of new approaches from other domains and introduce constraints that may affect…
- Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision ProcessesCyrille Kone, Kevin Jamieson · arXiv · May 5, 2026
We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high co…
- Generating Statistical Charts with Validation-Driven LLM WorkflowsPavlin G. Poličar, Andraž Pevcin, Blaž Zupan · arXiv · May 1, 2026
Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide ful…
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…
- Multiple Additive Neural Networks for Structured and Unstructured DataJanis Mohr, Jörg Frochte · arXiv · Apr 29, 2026
This paper extends and explains the Multiple Additive Neural Networks (MANN) methodology, an enhancement to the traditional Gradient Boosting framework, utilizing nearly shallow neural networks instead of decision trees as base learners. Th…
- VLA Foundry: A Unified Framework for Training Vision-Language-Action ModelsJean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang et al. · arXiv · Apr 21, 2026
We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines…
- Tabular foundation models for in-context prediction of molecular propertiesKarim K. Ben Hicham, Jan G. Rittig, Martin Grohe, Alexander Mitsos · arXiv · Apr 17, 2026
Accurate molecular property prediction is central to drug discovery, catalysis, and process design, yet real-world applications are often limited by small datasets. Molecular foundation models provide a promising direction by learning trans…
- Benchmarking Optimizers for MLPs in Tabular Deep LearningYury Gorishniy, Ivan Rubachev, Dmitrii Feoktistov, Artem Babenko · arXiv · Apr 16, 2026
MLP is a heavily used backbone in modern deep learning (DL) architectures for supervised learning on tabular data, and AdamW is the go-to optimizer used to train tabular DL models. Unlike architecture design, however, the choice of optimize…
- SemTabla: A Human-in-the-Loop Framework for Semantic Enrichment and Validation of Data TablesZhuochen Jin, Yingjie Mi, Yehang Zhu, yichen yao et al. · OpenAlex · Apr 13, 2026
Data tables are widely used to record critical information, enabling decision-makers to derive insights through table question answering (Table QA). However, the metadata from table schemas alone often fail to capture the underlying busines…
- Helix 1.0: An open-source framework for reproducible and interpretable machine learning on tabular scientific dataEduardo Aguilar-Bejarano, Daniel Lea, Karthikeyan Sivakumar, Jimiama M. Mase et al. · Patterns · Apr 1, 2026
- Explainable artificial intelligence for cross domain evaluation of predictive models in multi-disease diagnosisZulfikar Ali Ansari, K. Kiran Kumar, S. Sameen Fatima, Shadab Siddiqui et al. · Discover Computing · Mar 19, 2026
Accurate and interpretable disease prediction is one of the major challenges faced in healthcare, especially for breast, heart, and lung cancers. This study proposes a highly structured, leakage-safe benchmarking framework for comparing con…
- DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis ScreeningZhixiang Lu, Yulong Li, Feilong Tang, Zhengyong Jiang et al. · Proceedings of the AAAI Con... · Mar 14, 2026
Large-scale tuberculosis (TB) screening is limited by the high cost and operational complexity of traditional diagnostics, creating a need for artificial-intelligence solutions. We propose DeepGB-TB, a non-invasive system that instantly ass…
- An Advanced Hybrid LSTM–XGBoost Framework for Data-Driven Hydrological Inflow Classification in Complex Dam Systems: The Case of Beni Haroun, AlgeriaNadjet Chettih, Ahlam Labdaoui, Dounia Keddari, Farah Boutouatou · Mathematical Modelling and ... · Feb 28, 2026
Accurate classification of inflow regimes supports reservoir operation and riskinformed water management, yet it remains challenging due to nonlinear hydroclimatic dynamics and temporal dependence.This study proposes an advanced hybrid math…
- Class-aware temporal and contextual contrastive framework for semi-supervised automated fault detection and diagnosis in air handling unitsSeunghyeon Wang · Energy and Buildings · Feb 25, 2026
Automated Fault Detection and Diagnosis (AFDD) for Air Handling Units (AHUs) has largely relied on supervised learning, which is difficult to deploy when labeled data are scarce and fault classes are imbalanced. Existing label-efficient AFD…
- TabStruct: Measuring Structural Fidelity of Tabular DataXiangjian Jiang, Nikola Simidjievski, Mateja Jamnik · ICLR 2026 Oral · Jan 26, 2026
Evaluating tabular generators remains a challenging problem, as the unique causal structural prior of heterogeneous tabular data does not lend itself to intuitive human inspection. Recent work has introduced structural fidelity as a tabular…
- Tabular Learning for Biomedical DataLaurence Liang, Veronika Pak, Zachary Yang · AITD@EurIPS 2025 Poster · Nov 18, 2025
Foundation models trained on tabular data, such as TabPFN and TabICL, have demonstrated strong classification performance on synthetic and benchmark datasets. Applications to biological datasets are emerging but remain comparatively underex…
- Differentially Private Synthetic Data via APIs 4: Tabular DataToan Tran, Arturs Backurs, Zinan Lin, Victor Reis et al. · Submitted to ICLR 2026 · Sep 19, 2025
Tabular data is one of the most widely used formats in practice, yet much of it remains inaccessible due to privacy concerns. Synthetic data generation with formal privacy guarantees, i.e. differential privacy (DP), offers a promising solut…
- TabSTAR: A Tabular Foundation Model for Tabular Data with Text FieldsAlan Arazi, Eilam Shapira, Roi Reichart · NeurIPS 2025 poster · Sep 18, 2025
While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees. However, recent advancements are paving the w…