It is a very simple NN (could be a LUT, actually) controller, the same for all agents, optimized with CMA-ES to maximize digging depth after 1000 steps.
Input: state of 6 neighbours, rotated according to heading. (Agents have no memory.)
Output: action probabilities (softmax). Four actions: move-forward, forward-right, forward-left, pull-back.
Some fun facts about that optimization result:
It's mostly about the bots learning to avoid clogging each other up. They delay digging the hole to lower the chance for multiple digging sites (which all would clog up the center before reaching max depth).
They linger around the gravel (which they can recognize by their 6-neighbour-input), which again lowers the chance of creating another digging site, or burying each other with their randomized actions.