Large Language Models for Data Imputation: Behavior, Hallucination Effects, and Control Mechanisms

This repository provides a comprehensive experimental framework for evaluating Large Language Models (LLMs) in the context of missing data imputation. The study investigates both performance and behavioral aspects, including hallucination effects and control mechanisms.

Overview

We evaluate five LLM families across different architectures, including:

Mistral
Claude
GPT
Gemini

These models are benchmarked against traditional and state-of-the-art imputation methods:

k-Nearest Neighbors (kNN)
Multivariate Imputation by Chained Equations (MICE)
missForest
Stacked Autoencoder Imputation (SAEI)
TabPFN

Experiments are conducted under the three standard missing data mechanisms:

Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)

Results

Empirical results indicate that Claude 4.5 Sonnet and Gemini 3.0 Flash consistently outperform baseline methods across all missingness mechanisms.

MCAR

MAR

MNAR

In terms of computational efficiency, LLM-based approaches require significantly more resources compared to traditional imputation methods.

Installation

Install the required dependencies:

pip install -r requirements.txt

We recommend using a dedicated virtual environment that could be found here.

source LLM/bin/activate # On Linux/macOS
.\LLM\Scripts\activate   # On Windows

Computational Considerations

LLMs introduce a substantial computational overhead compared to classical methods. This includes:

Higher latency due to API calls
Increased monetary cost (depending on provider)
Dependency on external services

These aspects should be considered when deploying LLM-based imputation in practice.

Related Publication

This work has been submitted to IEEE Transactions on Knowledge and Data. Further details will be provided upon publication.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
LLM		LLM
algorithms		algorithms
codes		codes
data		data
figs		figs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Models for Data Imputation: Behavior, Hallucination Effects, and Control Mechanisms

Overview

Results

MCAR

MAR

MNAR

Installation

Computational Considerations

Related Publication

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Large Language Models for Data Imputation: Behavior, Hallucination Effects, and Control Mechanisms

Overview

Results

MCAR

MAR

MNAR

Installation

Computational Considerations

Related Publication

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages