William Brannon

I’m a researcher in AI and machine learning, working on LLMs, data-centric AI, and computational methods for understanding how people form and change their opinions. I recently completed my PhD at MIT, where my work combined machine learning, causal inference, and computational social science.

My research has three main threads:

Data-centric AI: Large language models, the role and provenance of training data, and how we evaluate models in light of their data. This includes serving as one of the leads of the Data Provenance Initiative, which audits AI training datasets and builds tools for licensing, consent, and transparency.
Causal AI: Developing ML-based methods for causal inference and treatment effect estimation, especially using LLMs to estimate heterogeneous treatment effects in randomized experiments. Most recently, I’ve applied these methods to persuasion experiments, using LLMs as opinion models to simulate opinion change across different audiences.
Computational social science: Persuasion, media ecosystems, and social networks. I’ve worked on news-cycle dynamics across media platforms, graph-aware architectures and training objectives for social network data, and LLM-powered tools like AudienceView and the Bridging Dictionary to help journalists interpret audience feedback and politically charged language.

Methodologically, I work with LLM post-training and inference techniques, randomized experiments and other causal-inference tools, and representation learning methods for text and network data. This work uses modern Python ML stacks (PyTorch, Hugging Face) on cloud and HPC infrastructure. Earlier in my career, I spent several years as a data scientist in U.S. politics (at the DNC, the 2012 presidential campaign, and the Analyst Institute), which still shapes how I think about experimentation, stakeholders, and deployed models.

If you’re interested in these areas and want to chat about collaborations or research/ML roles, please get in touch! You can reach me by email: will.brannon@gmail.com. For more info, you can also download my CV or check out my publications.

news

May 13, 2025	I successfully defended my PhD dissertation, “Language Models as Opinion Models: Techniques and Applications,” earlier today! The dissertation is not available online yet, but it and preprint versions of new work in it will be shortly.
Apr 26, 2025	Our paper “Bridging the Data Provenance Gap Across Text, Speech and Video” appeared today at ICLR 2025!
Jan 22, 2025	ICLR 2025 has accepted our new paper “Bridging the Data Provenance Gap Across Text, Speech and Video”! This paper is the third phase of work in the Data Provenance Initiative.
Dec 11, 2024	The latest Data Provenance Initiative paper, “Consent in Crisis: The Rapid Decline of the AI Data Commons”, appeared today at NeurIPS 2024.
Nov 12, 2024	Our paper “On the Relationship between Truth and Political Bias in Language Models” appeared as a main-conference poster at EMNLP 2024!

selected publications

Nat. Mach. Intel.
A Large-Scale Audit of Dataset Licensing and Attribution in AI

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, 7 more authorsNiklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

Nature Machine Intelligence, Aug 2024

Abs DOI arXiv Bib PDF Code Data

The race to train language models on vast, diverse, and inconsistently documented datasets raises pressing legal and ethical concerns. To improve data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1,800+ text datasets. We develop tools and standards to trace the lineage of these datasets, including their source, creators, licenses, and subsequent use. Our landscape analysis highlights sharp divides in the composition and focus of data licensed for commercial use. Important categories including low resource languages, creative tasks, and novel synthetic data all tend to be restrictively licensed. We observe frequent miscategorization of licenses on popular dataset hosting sites, with license omission rates of 70%+ and error rates of 50%+. This highlights a crisis in misattribution and informed use of popular datasets driving many recent breakthroughs. Our analysis of data sources also elucidates the application of copyright law and fair use to finetuning data. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our audit, with an interactive UI, the Data Provenance Explorer, to enable practitioners to trace and filter on data provenance for the most popular finetuning data collections: www.dataprovenance.org.
@article{longpreLargeScaleAuditDataset2024, title = {A Large-Scale Audit of Dataset Licensing and Attribution in {AI}}, author = {Longpre, Shayne and Mahari, Robert and Chen, Anthony and {Obeng-Marnu}, Naana and Sileo, Damien and Brannon, William and Muennighoff, Niklas and Khazam, Nathan and Kabbara, Jad and Perisetla, Kartik and Wu, Xinyi and Shippole, Enrico and Bollacker, Kurt and Wu, Tongshuang and Villa, Luis and Pentland, Sandy and Hooker, Sara}, journal = {Nature Machine Intelligence}, volume = {6}, number = {8}, pages = {975--987}, publisher = {Nature Publishing Group}, date = {2024-08-30}, year = {2024}, month = aug, doi = {10.1038/s42256-024-00878-8}, eprint = {2310.16787}, primaryclass = {cs}, archiveprefix = {arXiv}, url = {https://www.nature.com/articles/s42256-024-00878-8} }
TextGraphs
ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

William Brannon, Suyash Fulay, Hang Jiang, Wonjune Kang, Brandon Roy, Deb Roy, and Jad Kabbara

TextGraphs at ACL, Aug 2024

Abs arXiv Bib PDF Video Code Slides

Learning on text-attributed graphs (TAGs), in which nodes are associated with one or more texts, has been the subject of much recent work. However, most approaches tend to make strong assumptions about the downstream task of interest, are reliant on hand-labeled data, or fail to equally balance the importance of both text and graph representations. In this work, we propose Contrastive Graph-Text pretraining (ConGraT), a general, self-supervised approach for jointly learning separate representations of texts and nodes in a TAG. Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP. We further propose an extension to the CLIP objective that leverages graph structure to incorporate information about inter-node similarity. Extensive experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling. Finally, we present an application of our method to community detection in social graphs, which enables finding more textually grounded communities, rather than purely graph-based ones. Code and certain datasets are available at https://github.com/wwbrannon/congrat.
@inproceedings{brannonConGraTSelfSupervisedContrastive2024, keywords = {workshop}, title = {{C}on{G}ra{T}: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings}, author = {Brannon, William and Fulay, Suyash and Jiang, Hang and Kang, Wonjune and Roy, Brandon and Roy, Deb and Kabbara, Jad}, booktitle = {Proceedings of the Seventeenth Workshop on Graph-Based Methods for Natural Language Processing ({T}ext{G}raphs-17)}, editor = {Ustalov, Dmitry and Gao, Yanjun and Panchenko, Alexander and Tutubalina, Elena and Nikishina, Irina and Ramesh, Arti and Sakhovskiy, Andrey and Usbeck, Ricardo and Penn, Gerald and Valentino, Marco}, pages = {19 --- 39}, publisher = {Association for Computational Linguistics}, address = {Bangkok, Thailand}, date = {2024-08-15}, year = {2024}, month = aug, eprint = {2305.14321}, primaryclass = {cs}, archiveprefix = {arXiv}, url = {https://aclanthology.org/2024.textgraphs-1.2} }
SciRep
The Speed of News in Twitter (X) versus Radio

William Brannon and Deb Roy

Scientific Reports, May 2024

Abs DOI Bib PDF Supp Video Code Data Slides

The rapid evolution of the Internet is reshaping the media landscape, with frequent claims of an accelerated and increasingly outraged news cycle. We test these claims empirically, investigating the dynamics of news spread, decay, and sentiment on Twitter (now known as X) compared to talk radio. Analyzing 2019–2021 data including 517,000 hour of radio content and 26.6 million tweets by elite journalists, politicians, and general users, we identified 1694 news events. We find that news on Twitter circulates faster, fades faster, and is more negative and outraged compared to radio, with Twitter outrage also more short-lived. These patterns are consistent across various user types and robustness checks. Our results illustrate an important way social media may influence traditional media: framing and agenda-setting simply by speaking first. As journalism evolves with these media, news audiences may encounter faster shifts in focus, less attention to each news event, and much more negativity and outrage.
@article{brannonSpeedNewsTwitter2024, title = {The Speed of News in {T}witter ({X}) versus Radio}, author = {Brannon, William and Roy, Deb}, journal = {Scientific Reports}, volume = {14}, number = {1}, pages = {11939}, publisher = {Nature Publishing Group}, date = {2024-05-24}, year = {2024}, month = may, doi = {10.1038/s41598-024-61921-7}, url = {https://www.nature.com/articles/s41598-024-61921-7} }