ProgramBench: Can Language Models Rebuild Programs from Scratch?

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with...

AI Summary

ProgramBench is a new benchmark that evaluates whether language models can build complete software programs from scratch based on documentation, rather than performing narrow tasks like bug fixes. The benchmark includes 200 tasks ranging from simple CLI tools to complex software like FFmpeg and SQLite, with evaluation based on behavioral tests generated through fuzzing. Testing nine language models revealed that none fully solved any task, with the best model passing 95% of tests on only 3% of tasks, and models tended to create single-file implementations that differ significantly from human-written code structure.

Read Original → · Discuss with AI → · Share →
← Back to news