CUDA or not?

Jul 31, 2024

—

In the quest to speed-up processing of Sudokus, we’re wondering – does it make sense to recode Sudoku solving using CUDA? Before you do that, it makes to optimize what we have already and do some preparatory work to smooth the transition to CUDA if needed.

Remember – we’re not trying to all Sudokus of any size, though that might be nice. Specifically, we are trying to speed up the solving of large Sudokus which in theory could be solved by anyone using hidden singles strategies only. Most of the time in the code is spent resolving what values are candidate solutions for a block.

boxpos = order * (row // order) + (col // order)
candidates  = reduce(np.intersect1d, (RowCandidates[row],ColCandidates[col],BoxCandidates[boxpos]))
if candidates.size == 1:
    return row, col, candidates[0]
else:
    return None

Leveraging Numpy

For each cell that is blank we check each associated row, column, and block or box to see if there is only one value that satisfies the solution. The first step in optimizing is to vectorize this code. Unfortunately, it doesn’t seem to vectorize easily. However, if we reframe the problem so that we take advantage of numpy’s built in vectorization, we get

# grid is a 2D numpy array for the Sudoku puzzle
# RowCandidates is a 1D boolean array where each value present in a row is noted
# ColCandidates is a 1D boolean array where each value present in a column is noted
# BoxCandidates is a 1D boolean array where each value present in a block is noted
mask = (grid == 0)
indices = np.where(mask)
rowList, colList = indices
boxList = order * (rowList // order) + (colList // order)
candidates = ~(RowCandidates[rowList] | ColCandidates[colList] | BoxCandidates[boxList])

The code looks somewhat similar. However, now we are handling the entire Sudoku for all blanks that are present. Wait a minute. This looks suspiciously like the candidate mode in most Sudoku games. We’re not against computing candidates for each cell. We’re against storing the candidates. Why? Because even in the best of cases the memory requirements are equal to

memory (bytes) = number of blanks * box dimension * box dimension

Key Message

What’s the big deal? The number of blanks scale with the box dimension raised to the 4th power. That means memory scales with the box dimension to the 6th power (or grid dimension to the 3rd power). You quickly run out of memory. Essentially calculating the potential candidate solutions for each cell vs. retaining the candidates in memory is a tradeoff of speed vs. memory. The speedup can be up to 40x – at least for the Sudokus that we can solve to date.

Can we compromise and use a little less memory and keep some of the speedup? We’ll see… And all of this effort is BEFORE we try to use CUDA with a GPU…

The key question is whether is enough “work” for the GPU to offset data transfer times. If we can develop a simulated annealing solver that might be more efficient for really large Sudokus.