TAU CS 0368-3414: Seminar on Foundations of Provable AI Security
Today’s AI systems are remarkably capable but often untrustworthy: we don’t really know why they make decisions or whether we can rely on them, especially when trained or hosted by others.
This seminar will explore emerging works in which ideas from cryptography and complexity theory, where we developed a rich set of tools to argue formally about security, can bring mathematical rigor to AI trust and safety.
Participating requires a background or feeling of comfort with theoretical computer science (proofs, rather than experiments). Some background in cryptography or machine learning is ideal but not necessary for students who are willing and able to catch up on basic definitions from these fields that will come up.
The course requirements include reading, understanding, and presenting one paper in class, as well as three short (two-three paragraphs) reflection assignments throughout the semester.
The seminar will run for 12 weeks. In each week, two students will prepare a pair of related talks (one hour and one hour) on the same paper or on closely related papers. In some weeks, the natural split is between different parts of one paper; in other weeks, the pair should divide a main paper together with one related/background paper.
Reference Material
Students do not need to already know all of the machine-learning or cryptography background needed for every topic. However, before picking a topic, it is a good idea to look at the reference material below and make sure you are comfortable catching up on whichever background your topic requires.
Core Presentation Topics
- Topic 1:
Undetectable Watermarking for Language Models (Yuval B. + Raz N., 5/5).
The basic cryptographic notion of watermarking language-model outputs without detectable degradation in quality.
[Main paper: Undetectable Watermarks for Language Models]
[Related: A Watermark for Large Language Models]
[Background blog post]
- Topic 2:
Robust Watermarking via Pseudorandom Codes (Erel B. + Edo K., 12/5).
How pseudorandom error-correcting codes (PRCs) yield stronger robustness guarantees for watermarking and related tasks.
[Main paper: Pseudorandom Error-Correcting Codes]
[Related: Edit Distance Robust Watermarks via Indexing Pseudorandom Codes]
[Further development: Improved Pseudorandom Codes from Permuted Puzzles]
Suggested scope: focus mainly on the early sections of the PRCs paper and use the related papers as context or extensions.
- Topic 3:
Undetectable Steganography for Language Models (Itamar S. + Omri Ba., 19/5).
Hiding arbitrary secret payloads inside model outputs, with provable indistinguishability guarantees.
[Main paper: Undetectable Steganography for Language Models]
[Related: steganography section in Pseudorandom Error-Correcting Codes]
[Optional consequences: Chain-of-thought monitoring / agent collusion (first sections only)]
- Topic 4:
Verification of PAC Learning (Itamar T. + Ori B., 26/5).
Interactive proofs for verifying that a learned hypothesis is approximately correct, sometimes with much less data than learning itself.
[Main paper: Interactive Proofs for Verifying Machine Learning]
- Topic 5:
Adversarial Examples and Computational Hardness (Omri Bo. + Amit H., 2/6).
A theoretical view in which robustness can rely on hardness assumptions.
[Main paper: Adversarially Robust Learning Could Leverage Computational Hardness]
[Optional background: adversarial examples lecture notes]
Suggested split: one talk on the adversarial-examples background and formal setup, one on the computational-hardness construction and implications.
- Topic 6:
Undetectable Backdoors in Machine Learning Models (Shaked S. + Itay R., 9/6).
A theoretical framework showing how a malicious trainer can plant a hidden backdoor while remaining computationally indistinguishable from a clean model.
[Main paper: Planting Undetectable Backdoors in Machine Learning Models]
Suggested split: one talk on the black-box signature-based construction, one on the white-box / learning-paradigm-specific construction and implications for robustness.
- Topic 7:
Backdoor Mitigation without Detection (Hadar K. + Nir S., 16/6).
How to remove backdoors while avoiding an explicit detection step.
[Main paper: Oblivious Defense in ML Models: Backdoor Removal without Detection]
- Topic 8:
Cryptographic Hardness of Learning (Roee C. + Ron G., 23/6).
Showing that some concept classes are PAC learnable in principle but not efficiently learnable under standard cryptographic assumptions.
[Main paper: Cryptographic Hardness of Learning Halfspaces with Massart Noise]
[Background: Continuous LWE]
- Topic 9:
Model Extraction Attacks as Cryptanalysis (Ofek L. + Shahaf G., 30/6).
Exact or near-exact recovery of model parameters from query access, viewed through a cryptanalytic lens.
[Main option: Cryptanalytic Extraction of Neural Network Models]
[Advanced option: Polynomial Time Cryptanalytic Extraction of Deep Neural Networks in the Hard-Label Setting]
- Topic 10:
The Computational Intractability of Filtering for AI Alignment (Gal C. + Nathan L., 7/7).
Formal limitations on using external filters or judges to separate safe from unsafe behavior.
[Main paper: On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment]
[Related consequences: Bypassing Prompt Guards in Production with Controlled-Release Prompting]
The list above contains the core pool of topics. The final 12 seminar slots will be chosen from these topics together with some of the optional topics below, depending on student interest and overlap in preferences.
Optional Topics
* Optional topics may or may not be filled by students.