Skip to content

LAMDA-Tabular/Tabular-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 

Repository files navigation

Representation Learning for Tabular Data: A Comprehensive Survey

Awesome Tabular Deep Learning for "Representation Learning for Tabular Data: A Comprehensive Survey". If you use any content of this repo for your work, please cite the following bib entry:

@article{jiang2025tabularsurvey,
         title={Representation Learning for Tabular Data: A Comprehensive Survey}, 
         author={Jun-Peng Jiang and
                 Si-Yang Liu and
                 Hao-Run Cai and
                 Qile Zhou and
                 Han-Jia Ye},
         journal={arXiv preprint arXiv:2504.16109},
         year={2025}
}

Feel free to create new issues or drop me an email if you find any interesting paper missing in our survey, and we shall include them in the next version.

Updates

[04/2025] arXiv paper has been released.

[04/2025] The repository has been released.

Introduction

Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data—features, samples, and objectives—and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without additional fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding tasks.

Some Basic Resources

Benchmarks

Date Name Paper Publication Code
2025 TabArena TabArena: A Living Benchmark for Machine Learning on Tabular Data CoRR Code
2025 MLE-Bench MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering ICLR Code
2025 TabReD TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks ICLR Code
2024 Data-Centric Benchmark A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data NeurIPS Code
2024 Better-by-Default Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data NeurIPS Code
2024 LAMDA-Tabular-Bench A Closer Look at Deep Learning Methods on Tabular Datasets CoRR Code
2024 DMLR-ICLR24-Datasets-for-Benchmarking Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning DMLR Code
2023 TableShift Benchmarking Distribution Shift in Tabular Data with TableShift NeurIPS Code
2023 TabZilla When Do Neural Nets Outperform Boosted Trees on Tabular Data? NeurIPS Code
2023 EncoderBenchmarking A benchmark of categorical encoders for binary classification NeurIPS Code
2022 Grinsztajn et al. Benchmark Why do tree-based models still outperform deep learning on tabular data? NeurIPS Code
2021 RTDL Revisiting Deep Learning Models for Tabular Data NeurIPS Code
2021 WellTunedSimpleNets Well-tuned Simple Nets Excel on Tabular Datasets NeurIPS Code

Awesome Deep Tabular Toolboxs

  • RTDL: A collection of papers and packages on deep learning for tabular data.
  • TALENT: A comprehensive toolkit and benchmark for tabular data learning, featuring 30 deep methods, more than 10 classical methods, and 300 diverse tabular datasets.
  • pytorch_tabular: A standard framework for modelling Deep Learning Models for tabular data.
  • pytorch-frame: A modular deep learning framework for building neural network models on heterogeneous tabular data.
  • DeepTables: An easy-to-use toolkit that enables deep learning to unleash great power on tabular data.
  • AutoGluon: A toolbox which automates machine learning tasks and enables to easily achieve strong predictive performance.
  • ...

Other Awesome Repositories

TabPFN and its extensions

Some summary repositories

Specialized Methods

Date Name Paper Publication Code
2025 ModernNCA Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later ICLR Static Badge
2025 TabM TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling ICLR Static Badge
2024 ExcelFormer Can a deep learning model be a sure bet for tabular prediction? KDD Static Badge
2024 AMFormer Arithmetic feature interaction is necessary for deep tabular learning AAAI Static Badge
2024 GRANDE GRANDE: gradient-based decision tree ensembles for tabular data ICLR Static Badge
2024 DOFEN DOFEN: Deep Oblivious Forest ENsemble NeurIPS Static Badge
2024 RealMLP Better by default: Strong pre-tuned mlps and boosted trees on tabular data NeurIPS Static Badge
2024 BiSHop Bishop: Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model ICML Static Badge
2024 SwitchTab Switchtab: Switched autoencoders are effective tabular learners AAAI
2024 PTaRL Ptarl: Prototype-based tabular representation learning via space calibration ICLR Static Badge
2024 TabR Tabr: Tabular deep learning meets nearest neighbors in 2023 ICLR Static Badge
2023 An inductive bias for tabular deep learning NeurIPS
2023 TabRet Tabret: Pre-training transformer-based tabular models for unseen columns CoRR Static Badge
2023 Trompt Trompt: Towards a better deep neural network for tabular data ICML
2023 TANGOS Tangos: Regularizing tabular neural networks through gradient orthogonalization and specialization ICLR Static Badge
2022 MLP-PLR On embeddings for numerical features in tabular deep learning NeurIPS Static Badge
2022 SAINT SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training NeurIPS WS Static Badge
2022 DANets Danets: Deep abstract networks for tabular data classification and regression AAAI Static Badge
2022 DNNR DNNR: differential nearest neighbors regression ICML Static Badge
2022 Hopular Hopular: Modern hopfield networks for tabular data CoRR Static Badge
2022 LSPIN Locally Sparse Neural Networks for Tabular Biomedical Data ICML Static Badge
2021 Net-DNF Net-DNF: Effective Deep Modeling of Tabular Data ICLR
2021 FT-Transformer Revisiting deep learning models for tabular data NeurIPS Static Badge
2021 TabNet Tabnet: Attentive interpretable tabular learning AAAI Static Badge
2021 DCNv2 DCN V2: improved deep & cross network and practical lessons for web-scale learning to rank systems WWW Static Badge
2021 Well-tuned simple nets excel on tabular datasets NeurIPS Static Badge
2021 NPT Self-attention between datapoints: Going beyond individual input-output pairs in deep learning NeurIPS Static Badge
2020 Survey on categorical data for neural networks Journal of big data
2020 TabTransformer Tabtransformer: Tabular data modeling using contextual embeddings CoRR Static Badge
2020 GrowNet Gradient boosting neural networks: Grownet CoRR Static Badge
2020 NODE Neural oblivious decision ensembles for deep learning on tabular data ICLR Static Badge
2020 STG Feature Selection using Stochastic Gates ICML Static Badge
2019 AutoInt Autoint: Automatic feature interaction learning via self-attentive neural networks CIKM Static Badge
2018 RLNs Regularization learning networks: deep learning for tabular datasets NeurIPS Static Badge
2017 SNN Selfnormalizing neural networks NIPS Static Badge

Transferable Methods

Date Name Paper Publication Code
2025 A survey on self-supervised learning for non-sequential tabular data Machine Learning Static Badge
2025 Tab2Visual Tab2Visual: Overcoming Limited Data in Tabular Data Classification Using Deep Learning with Visual Representations CoRR
2024 LFR Self-supervised representation learning from random data projectors ICLR
2024 UniTabE UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science ICLR
2024 CM2 Towards cross-table masked pretraining for web data mining WWW Static Badge
2024 TP-BERTa Making pre-trained language models great on tabular prediction ICLR Static Badge
2024 CARTE CARTE: pretraining and transfer for tabular learning ICML Static Badge
2024 FeatLLM Large language models can automatically engineer features for few-shot tabular learning ICML Static Badge
2024 LM-IGTD LM-IGTD: a 2d image generator for low-dimensional and mixed-type tabular data to leverage the potential of convolutional neural networks CoRR
2023 DoRA Dora: Domain-based self-supervised learning framework for low-resource real estate appraisal CIKM Static Badge
2023 Transfer learning with deep tabular models ICLR
2023 ReConTab Recontab: Regularized contrastive representation learning for tabular data CoRR
2023 TabRet Tabret: Pre-training transformer-based tabular models for unseen columns CoRR Static Badge
2023 ORCA Cross-modal fine-tuning: Align then refine ICML Static Badge
2023 TabToken Unlocking the transferability of tokens in deep models for tabular data CoRR
2023 Transfer learning with deep tabular models ICLR
2023 Xtab Xtab: Cross-table pretraining for tabular transformers ICML Static Badge
2023 Meta-Transformer Meta-transformer: A unified framework for multimodal learning CoRR Static Badge
2023 Binder Binding language models in symbolic languages ICLR Static Badge
2023 CAAFE Large language models for automated data science: Introducing caafe for context-aware automated feature engineering NeurIPS Static Badge
2023 TaPTaP Generative table pre-training empowers models for tabular prediction EMNLP Static Badge
2023 TabLLM Tabllm: few-shot classification of tabular data with large language models AISTATS Static Badge
2023 UniPredict Unipredict: Large language models are universal tabular predictors CoRR
2023 TablEye Tableye: Seeing small tables through the lens of images CoRR
2022 Revisiting pretraining objectives for tabular deep learning CoRR Static Badge
2022 SEFS Self-supervision enhanced feature selection with correlated gates ICLR Static Badge
2022 MET MET: masked encoding for tabular data CoRR
2022 SAINT SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training NeurIPS WS Static Badge
2022 SCARF Scarf: Self-supervised contrastive learning using random feature corruption ICLR
2022 Stab Stab: Self-supervised learning for tabular data NeurIPS WS
2022 DEN Distribution embedding networks for generalization from a diverse set of classification tasks
2022 TransTab Transtab: Learning transferable tabular transformers across tables NeurIPS Static Badge
2022 Ptab Ptab: Using the pre-trained language model for modeling tabular data CoRR
2022 LIFT LIFT: language-interfaced fine-tuning for non-language machine learning tasks NeurIPS Static Badge
2021 SubTab Subtab: Subsetting features of tabular data for self-supervised representation learning NeurIPS Static Badge
2021 DACL Towards domain-agnostic contrastive learning ICML
2021 IGTD Converting tabular data into images for deep learning with convolutional neural networks Scientific reports Static Badge
2020 VIME VIME: extending the success of self- and semi-supervised learning to tabular domain NeurIPS Static Badge
2020 Meta-learning from tasks with heterogeneous attribute spaces NeurIPS
2020 TAC A novel method for classification of tabular data using convolutional neural networks Biorxiv
2019 Super-TML Supertml: Two-dimensional word embedding for the precognition on structured tabular data CVPR WS

General Methods

Date Name Paper Publication Code
2025 Beta* Tabpfn unleashed: A scalable and effective solution to tabular classification problems ICML
2025 MotherNet MotherNet: Fast Training and Inference via Hyper-Network Transformers ICLR Static Badge
2025 TabPFN v2 Accurate predictions on small data with a tabular foundation model Nature Static Badge
2025 TabForestPFN* Fine-tuned in-context learning transformers are excellent tabular data classifiers CoRR
2025 APT* Zero-shot meta-learning for tabular prediction tasks with adversarially pre-trained transformer CoRR Static Badge
2025 TabICL* Tabicl: A tabular foundation model for in-context learning on large data ICML Static Badge
2025 EquiTabPFN* Equitabpfn: A targetpermutation equivariant prior fitted networks CoRR
2025 * Scalable in-context learning on tabular data via retrieval-augmented large language models CoRR
2024 HyperFast Hyperfast: Instant classification for tabular data AAAI Static Badge
2024 TabDPT* Tabdpt: Scaling tabular foundation models CoRR Static Badge
2024 MIXTUREPFN* Mixture of incontext prompters for tabular pfns CoRR
2024 LoCalPFN* Retrieval & fine-tuning for in-context tabular models NeurIPS Static Badge
2024 LE-TabPFN* Towards localization via data embedding for tabPFN NeurIPS WS
2024 TabFlex* Tabflex: Scaling tabular learning to millions with linear attention NeurIPS WS
2024 * Exploration of autoregressive models for in-context learning on tabular data NeurIPS WS
2024 TabuLa-8B Large scale transfer learning for tabular data via language modeling NeurIPS Static Badge
2024 GTL From supervised to generative: A novel paradigm for tabular deep learning with large language models KDD Static Badge
2024 MediTab Meditab: Scaling medical tabular data predictors via data consolidation, enrichment, and refinement IJCAI
2023 TabPTM Training-free generalization on heterogeneous tabular data via meta-representation CoRR
2023 TabPFN Tabpfn: A transformer that solves small tabular classification problems in a second ICLR Static Badge

* denotes that the method is a variation of TabPFN, some of which requires fine-tuning for downstream tasks.

Ensemble Methods

Date Name Paper Publication Code
2025 TabM TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling ICLR Static Badge
2025 TabPFN v2 Accurate predictions on small data with a tabular foundation model Nature Static Badge
2025 Beta Tabpfn unleashed: A scalable and effective solution to tabular classification problems CoRR
2025 LLM-Boost, PFN-Boost Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes CoRR Static Badge
2024 HyperFast Hyperfast: Instant classification for tabular data AAAI Static Badge
2024 GRANDE GRANDE: gradient-based decision tree ensembles for tabular data ICLR Static Badge
2023 TabPTM Training-free generalization on heterogeneous tabular data via meta-representation CoRR
2023 TabPFN Tabpfn: A transformer that solves small tabular classification problems in a second ICLR Static Badge
2020 TabTransformer Tabtransformer: Tabular data modeling using contextual embeddings CoRR Static Badge
2020 GrowNet Gradient boosting neural networks: Grownet CoRR Static Badge
2020 NODE Neural oblivious decision ensembles for deep learning on tabular data ICLR Static Badge

Extensions

Clustering

Anomaly Detection

Tabular Generation

Interpretability

Open-Environment Tabular Machine Learning

Multi-modal Learning with Tabular Data

Tabular Understanding

Please refer to Awesome-Tabular-LLMs for more information.

Workshops

Acknowledgment

This repo is modified from TALENT.

Correspondence

This repo is developed and maintained by Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, and Han-Jia Ye. If you have any questions, please feel free to contact us by opening new issues or email:

About

Awesome Tabular Deep Learning for "Representation Learning for Tabular Data: A Comprehensive Survey"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •