Awesome Tabular Deep Learning for "Representation Learning for Tabular Data: A Comprehensive Survey". If you use any content of this repo for your work, please cite the following bib entry:
@article{jiang2025tabularsurvey,
title={Representation Learning for Tabular Data: A Comprehensive Survey},
author={Jun-Peng Jiang and
Si-Yang Liu and
Hao-Run Cai and
Qile Zhou and
Han-Jia Ye},
journal={arXiv preprint arXiv:2504.16109},
year={2025}
}
Feel free to create new issues or drop me an email if you find any interesting paper missing in our survey, and we shall include them in the next version.
[04/2025] arXiv paper has been released.
[04/2025] The repository has been released.
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data—features, samples, and objectives—and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without additional fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding tasks.
- RTDL: A collection of papers and packages on deep learning for tabular data.
- TALENT: A comprehensive toolkit and benchmark for tabular data learning, featuring 30 deep methods, more than 10 classical methods, and 300 diverse tabular datasets.
- pytorch_tabular: A standard framework for modelling Deep Learning Models for tabular data.
- pytorch-frame: A modular deep learning framework for building neural network models on heterogeneous tabular data.
- DeepTables: An easy-to-use toolkit that enables deep learning to unleash great power on tabular data.
- AutoGluon: A toolbox which automates machine learning tasks and enables to easily achieve strong predictive performance.
- ...
TabPFN and its extensions
- TabPFN v1: TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second.
- TabPFN v2: Accurate predictions on small data with a tabular foundation model.
- TabPFN extensions.
- TabPFN-Time-Series: The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features.
- TabICL: TabICL: A Tabular Foundation Model for In-Context Learning on Large Data.
- TICL:
- ...
Some summary repositories
* denotes that the method is a variation of TabPFN, some of which requires fine-tuning for downstream tasks.
Clustering
Anomaly Detection
- Anomaly detection for tabular data with internal contrastive learning
- Adbench: Anomaly detection benchmark
- Anomaly detection of tabular data using llms
- Anomaly Detection with Variance Stabilized Density Estimation
- Transductive and Inductive Outlier Detection with Robust Autoencoders
- ...
Tabular Generation
- Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis
- Causality for tabular data synthesis: A high-order structure causal benchmark framework
- Generating new concepts with hybrid neuro-symbolic models
- ...
Interpretability
- Tabnet: Attentive interpretable tabular learning
- Tabtransformer: Tabular data modeling using contextual embeddings
- Revisiting deep learning models for tabular data
- Neural oblivious decision ensembles for deep learning on tabular data
- ...
Open-Environment Tabular Machine Learning
- TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling
- Driftresilient tabpfn: In-context learning temporal distribution shifts on tabular data
- Benchmarking Distribution Shift in Tabular Data with TableShift
- TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks
- Understanding the limits of deep tabular methods with temporal shift
- ...
Multi-modal Learning with Tabular Data
- Best of both worlds: Multimodal contrastive learning with tabular and imaging data
- Tabular insights, visual impacts: Transferring expertise from tables to images
- Tip: Tabular-image pre-training for multimodal classification with incomplete data
- ...
Tabular Understanding
- Tablebank: Table benchmark for image-based table detection and recognition
- Multimodalqa: Complex question answering over text, tables and images
- Donut: Document understanding transformer without ocr
- Monkey: Image resolution and text label are important things for large multi-modal models
- mPLUG-DocOwl: Modularized multimodal large language model for document understanding
- Tabpedia: Towards comprehensive visual table understanding with concept synergy
- Compositional Condition Question Answering in Tabular Understanding
- Multimodal Tabular Reasoning with Privileged Structured Information
- REFOCUS: Visual Editing as a Chain of Thought for Structured Image Understanding
- ...
Please refer to Awesome-Tabular-LLMs for more information.
- Table Representation Learning Workshop @ NeurIPS 2022
- Table Representation Learning Workshop @ NeurIPS 2023
- Table Representation Learning Workshop @ NeurIPS 2024
- Table Representation Learning Workshop @ ACL 2025
- DExLLM workshop @ ICDE'25
- LLM + Vector Data @ ICDE 2025
- Data-AI Systems (DAIS) @ ICDE 2025
- Frontiers of DE & AI @ ICDE 2025
- Foundation Models for Structured Data @ ICML 2025
This repo is modified from TALENT.
This repo is developed and maintained by Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, and Han-Jia Ye. If you have any questions, please feel free to contact us by opening new issues or email:
- Jun-Peng Jiang: jiangjp@lamda.nju.edu.cn
- Si-Yang Liu: liusy@lamda.nju.edu.cn
- Hao-Run Cai: caihr@smail.nju.edu.cn
- Qile Zhou: zhouql@lamda.nju.edu.cn
- Han-Jia Ye: yehj@lamda.nju.edu.cn