Marc Ratkovic
IBD Amphi
AMU - AMSE
5-9 boulevard Maurice Bourdet
13001 Marseille
11:30 à 12:45
Nicolas Clootens : nicolas.clootens[at]univ-amu.fr
Romain Ferrali : romain.ferrali[at]univ-amu.fr
Text data pose significant challenges for statistical inference due to their high dimensionality and unstructured nature, and conventional methods based on fixed embeddings or topic models often oversimplify linguistic complexity without providing formal inferential guarantees. In this work, I integrate large language models (LLMs) with statistical inference by employing them to generate latent contexts that serve as auxiliary information. I introduce a clause function that quantifies the interaction between an observed text string and its LLM-generated latent contexts, and by averaging over these contexts, obtain string-level statistics. Under standard support and ignorability assumptions along with additional regularity conditions, I characterize the limiting distribution of these statistics—a result that extends to averages, regression coefficients, and robust rank- and quantile-based alternatives. Applications to a two-sample test and a regression model with text-based predictors and outcomes demonstrate the utility of this framework for rigorous inference on textual data.