Surface Evolver Bench: my benchmark asking LLMs to write complex physical simulations in a custom data format

I wrote a small custom benchmark based on some work I did in grad school. Surface Evolver is a tool released in 1992 (!) for modeling liquid surfaces. It is useful for tasks such as studying solder deposition on chips, modeling liquid fuel tanks or designing lab-on-a-chip networks.

To set up a simulation, you need to define a custom datafile with vertices, edges, faces, bodies, constraints, energies, and boundary integrals. I attached some sample (non-task) examples of liquid droplets (green) on solid surfaces (orange) including droplets sitting in ridges, briding between rods and in a cross-slot.

This makes it an interesting llm benchmark (I think) since there is a natural agentic loop of consulting docs, implementing the spec, running the simulation, debugging the output, etc.

- gpt5.5 is the best at this, only model to solve several of the tasks for now
 - glm5.2 is the best open model

Link: https://yhenon.github.io/surface-evolver-llm-eval/ [reposted after briefly having it up then deleting last week since I found some issues]

原始关键词#simulations#benchmark#physical#complex#evolver#surface

查看原文reddit.com

单一来源，暂无交叉验证