Marc Ratkovic

Séminaires généraux
amse seminar

Marc Ratkovic

University of Mannheim
Large Language Models for Statistical Inference: Context Augmentation with Applications to the Two-Sample Problem, Regression, and Concordance
Lieu

IBD Amphi

Îlot Bernard du Bois - Amphithéâtre

AMU - AMSE
5-9 boulevard Maurice Bourdet
13001 Marseille

Date(s)
Lundi 24 mars 2025
11:30 à 12:45
Contact(s)

Nicolas Clootens : nicolas.clootens[at]univ-amu.fr
Romain Ferrali : romain.ferrali[at]univ-amu.fr

Résumé

Text data pose significant challenges for statistical inference due to their high dimensionality and unstructured nature, and conventional methods based on fixed embeddings or topic models often oversimplify linguistic complexity without providing formal inferential guarantees. In this work, I integrate large language models (LLMs) with statistical inference by employing them to generate latent contexts that serve as auxiliary information. I introduce a clause function that quantifies the interaction between an observed text string and its LLM-generated latent contexts, and by averaging over these contexts, obtain string-level statistics. Under standard support and ignorability assumptions along with additional regularity conditions, I characterize the limiting distribution of these statistics—a result that extends to averages, regression coefficients, and robust rank- and quantile-based alternatives. Applications to a two-sample test and a regression model with text-based predictors and outcomes demonstrate the utility of this framework for rigorous inference on textual data.