Ratings

3 Matching Ratings

Rated ↑ Article
Alignment faking in large language models A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models anthropic.com 2,000 words Rated 2024-12-19 7:01pm - sethherr
Detecting and countering misuse of AI: August 2025 Anthropic's threat intelligence report on AI cybercrime and other abuses anthropic.com 2,000 words Rated 2025-9-1 9:03pm - sethherr
A small number of samples can poison LLMs of any size Anthropic research on data-poisoning attacks in large language models anthropic.com 2,000 words Rated 2025-10-9 9:39pm - sethherr