Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format
I wrote a small custom benchmark based on some work I did in grad school. Surface Evolver is a tool released in 1992 (!) for modeling liquid surfaces. It is useful for tasks such as studying solder deposition on chips, modeling liquid fuel tanks or designing lab-on-a-chip networks.
To set up a simulation, you need to define a custom datafile with vertices, edges, faces, bodies, constraints, energies, and boundary integrals. I attached some sample (non-task) examples of liquid droplets (green) on solid surfaces (orange) including droplets sitting in ridges, briding between rods and in a cross-slot.
This makes it an interesting llm benchmark (I think) since there is a natural agentic loop of consulting docs, implementing the spec, running the simulation, debugging the output, etc.
- gpt5.5 is the best at this, only model to solve several of the tasks for now - glm5.2 is the best open model
Link: https://yhenon.github.io/surface-evolver-llm-eval/ [reposted after briefly having it up then deleting last week since I found some issues]