DSP alum and University of Chicago professor Rebecca Willett is the inaugural recipient of the 2024 SIAM Activity Group on Data Science Career Prize. Becca received the prize for her work in physics-informed machine learning and data science and for her service and leadership in the data science community. From her pioneering work on photon-limited imaging to her analysis of generalization in overparameterized neural networks, Becca's work encompasses both the mathematical and statistical foundations of data science and the structure and context of problems from the natural sciences. Her research in physics-informed machine learning has contributed to bridging the gap between theoretical research and practical applications.

Becca is a professor of statistics and computer science and the Director of AI in the Data Science Institute at the University of Chicago.

Three DSP group papers have been accepted by The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) Findings 2024 in Miami, Florida:

  1. MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities by Naiming Liu, Shashank Sonkar, MyCo Le, and Richard G. Baraniuk
  2. Pedagogical Alignment of Large Language Models by Shashank Sonkar, Kangqi Ni, Sapana Chaudhary, and Richard G. Baraniuk
  3. The Student Data Paradox: Examining the Regressive Side Effects of Training LLMs for Personalized Learning by Shashank Sonkar, Naiming Liu, and Richard G. Baraniuk

To help organize the growing literature on AI self-consuming feedback loops, we have launched a "Self-Consuming AI Resources" archive at dsp.rice.edu/ai-loops.

In the 2000s, the Rice DSP group managed a similar archive for the field of compressive sensing, and it grew to several thousand papers that were used by a large community of researchers. We're hoping that this archive can be similarly useful.

We are currently in the process of refining the materials on the page. We would greatly appreciate it if you would recommend missing or new literature. There is also a ton of missing media coverage, and we are slowly working toward gathering it all.

Email us at selfconsumingAI@gmail.com to add your latest work or that of others in this fast-moving area!

Self-Improving Diffusion Models with Synthetic Data

Sina Alemohammad, Ahmed Imtiaz Humayun, Richard Baraniuk
Rice University
Shruti Agarwal, John Collomosse
Adobe Research

arxiv.org/abs/2408.16333, 30 August 2024

Abstract: The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model's generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fréchet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model's synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.

The figure above illustrates that SIMS simultaneously improves diffusion modeling and synthesis performance while acting as a prophylactic against Model Autophagy Disorder (MAD). First row: Samples from a base diffusion model (EDM2-S) trained on 1.28M real images from the ImageNet-512 dataset (Fréchet inception distance, FID = 2.56). Second row: Samples from the base model after fine-tuning with 1.5M images synthesized from the base model, which degrades synthesis performance and pushes the model towards MADness (model collapse) (FID = 6.07). Third row: Samples from the base model after applying SIMS using the same self-generated synthetic data as in the second row (FID = 1.73).

Rice DSP graduate student Lorenzo Luzi successfully defended his PhD thesis entitled "Overparameterization and double descent in PCA, GANs, and Diffusion models".

Abstract: I study overparameterization in PCA and generative adversarial networks (GANs), and generative models in general. Specifically, I study models that can interpolate the training data. I show that overparameterization can improve generalization performance and accelerate the training process in several contexts. I study the generalization error as a function of latent space dimension and identify two main behaviors, depending on the learning setting. First, I show that overparameterized generative models that learn distributions by minimizing a metric or f-divergence do not exhibit double descent in generalization errors; specifically, all the interpolating solutions achieve the same generalization error. Second, I develop a new pseudo-supervised learning approach for GANs and diffusion models where the training utilizes pairs of fabricated (noise) inputs in conjunction with real output samples. Our pseudo-supervised setting exhibits double descent (and in some cases, triple descent) of generalization errors. I combine pseudo-supervision with overparameterization (i.e., overly large latent space dimension)  to accelerate training while performing better, or close to, the generalization performance without pseudo-supervision. While my analysis focuses mostly on linear models, I also apply important insights for improving generalization of nonlinear, multilayer GANs. For the diffusion models, we see that pseudo-supervised samples can improve both performance and convergence speed of the learning algorithm.