Speed comparisons of a few Python/C Sudoku solvers

Published on April 4, 2023

DALL·E 2024-01-03 07.23.59 - An abstract watercolor-style painting merged with brutalist art for a footrace scene between anthropomorphized programming languages Python and C. Pyt.webp Figure: DALL-E's rendition of a race between Python and C.

This is mostly a mirror of my repo of the topic. The text has been written in a sort of diary format. The repo itself claims to be 'under construction' with more solver variations appearing in the future, but to be honest this is doubtful. For all the code see the repo; any filepath references should be taken relative to that.

Prologue - What are working with?

The goal here is to study the speed increases that I as a particular Python dev can achieve by going from 'naive Python' to using either more pythonic tools, using C-powered libraries like numpy, or by at the extreme using straight up C-code. This is less cutting-edge research and more of a personal learning diary. If you want to see a more serious study of the topic, see e.g. this.

Why?

C is faster than Python. There is very little question about that. But there is a much smaller difference between C and a Python program correctly using the C-based fast parts of Python or one of the fast C-based libraries like Numpy. In particular, in many cases (including mine) the important question is whether I the Python programmer could write C-code that runs faster than the implementations that rely on the Python libraries whose C backends have been written by professional C developers.

To choose a target for our speedup attempts, we're gonna look at a straightforward recursive Sudoku solver. To put all to the same page with the terminology, a sudoku is a logical game where you have a 9x9 grid divided into 9 3x3 blocks:

The rule of the game is to fill the grid with numbers 1-9 such that any row, column or block contains each of the numbers 1-9 exactly once. The difficulty of the game depends on which cells have been pre-filled with which numbers. Estimating the difficulty of a sudoku is a non-trivial task, though if you are given starting configuration that has only one solution then adding legal numbers always makes it easier. (Removing them might not since multiple solutions might emerge by reducing restrictions, and if the starting solution has multiple solutions then adding a number might rule out simpler solutions.)

For the system we'll handle data input and output as strings of numbers, i.e. a single sudoku will be an string of length 81, with each character one of the numbers 0-9. Here with a 0 we will mean that a cell is unfilled. The numbers in this string are taken to be in such an order that they fill the standard 9x9 grid one by one, starting from top left and moving through the cells left to right, top to bottom.

Internally we'll handle the Sudoku as a 9x9 grid of possibilities. By this I mean that the content of any given sudoku cell will be an array of booleans, where each index is representing if a given number could be allowed in that cell. We do this instead of just having a 9x9 grid of numbers 0-9 for two reasons: - This base architecture makes it much easier to include some more complicated strategies. - We don't want to make the algorithm and the underlying structure too simple. Benchmarking Python vs Numpy vs C with a for loop doesn't sound interesting enough.

So the basic structure we'll look at will be something we call a possibility grid:

boolean p_grid[NUM_ROWS][NUM_COLS][NUM_POSSIBILITIES]

though all of the LITERAL_CONSTANTS here are actually just 9 and, depending on the version of the solver, instead of booleans we might be using integers or chars.

Anyway, pseudocode for the main solver will look like this:

def recursive_solver(sudoku : PossibilityGrid) -> bool:

    changed = True
    while(changed):
        changed = reduce_possibilities(sudoku) # Clever strategies

    if is_solved(sudoku):
        return True

    # Recursive brute forcing starts here:
    cell_idx = first_unsolved_cell_index(sudoku)
    possibilities = get_cell_possibilities(cell_idx, sudoku)
    for p in possibilities:
        local_sudoku = sudoku.copy()
        set_cell_of_sudoku(cell_idx, local_sudoku, p)

        result = recursive_solver(local_sudoku)
        if result:
            return True

    return False

So here the reduce_possibilities is responsible for clever strategies like "for this cell, the only number not yet appearing in its row, col and block is 5, so the only possibility for this cell is 5". The latter recursive part will just start brute forcing the solution after the clever stuff has failed.

What's our benchmark data?

I found a nice open source (Public Domain licennse) sudoku source on (kaggle)[https://www.kaggle.com/datasets/rohanrao/sudoku]. From that I sampled randomly 10k sudokus, which can be found in the folder ./data/ as a csv. This turned out to be a big enough sample to get various levels of difficulty without being too slow to wait to complete.

Chapter 1 - Python versus C in various rounds

The main idea is that I take turns improving my Python solvers and C solvers, with the aim of alternatively trying to close the gap with the Python versions and then trying to expand the gap with C versions. The benchmarks were run (at the time of writing) on an Azure VM B2ms (2 vcpus, 8 GiB memory) running Ubuntu 18.04. In my rudimentary scorecard, getting Python within an order of magnitude range of C is a draw, and getting Python to take less than 5x the time that C takes is a win for Python.

Round 1 - The baseline - Simple Python vs Simple C

For the first iteration I built a very straightforward implementation in Python ./src/naive_sudoku_solver.py. It's called naive because it doesn't do anything fancy, though it aims to be as pythonic as possible and not do too much obviously stupid things. On the benchmark machine it solved 10k sudokus in just under 50 seconds.

To compete the naive Python version, I then made a simple C-version ./C-version/sudoku_solver.c. I call it simple because simple C is the best I can do, and it probably does some obviously stupid things as well because I don't have a good touch to C. But in any case it not only compiles but also runs and using the same algorithm it solves the same 10k sudokus correctly in about 5 seconds - so an order of magnitude improvement right off the bat.

	s/Sudoku	10k Sudokus
Naive Python	5.1e-03	51s
Naive C	5.4e-04	5.2s

So as expected, C is much much faster but with our scoring system we're "only" an order of magnitude apart so we're calling this a draw. Let's see how we could start improving the Python version a bit.

Round 1.5 - OOP solution

Before we start to do anything fancy, I see that we're gonna be building Sudoku solvers that improve or extend previous versions. Shouldn't we maybe do this in a more Object Oriented Programming way? It should surely make this nice in the long run, but doesn't OOP create extra baggage that will slow everything down? Let's test this!

So in ./src/OOP_sudoku.py we have the very same solver as before, but in as a Sudoku Solving Object. We run the results again and what we see is:

	s/Sudoku	10k Sudokus
Naive Python	5.1e-03	51s
Naive C	5.4e-04	5.2s
OOP Python	4.9e-03	49s

So no, the Object structure doesn't bring any extra baggage, we're even a bit faster than we were with the naive solution. If we find¹ the energy, we'll later on try to understand why we were faster here. (I wouldn't be surprised if the answer has something to do with "optimizing bytecode interpreter".)

Round 2 - Improved Naive Python

Let's start small before even trying to bring out the big guns. We take the OOP solver and try and see if there is anything to optimize along the lines of "iterating over (long) lists is slow, dicts and other hash-table based things are fast".

Analysis

First of all let's look at where we actually spend time in our current OOP solver by running the benchmark for that solver under cProfile:

python -m cProfile -o OOP_solver.prof benchmark.py

The results are kinda long, but let's look at the top time spenders based both on cumulative time which counts even when the given function has called another function (and waiting for its results) and on total time which excludes time in sub-functions. The commands we run look like this:

$ python -m pstats OOP_solver.prof 
Welcome to the profile statistics browser.
OOP_solver.prof% strip
OOP_solver.prof% sort cumtime
OOP_solver.prof% stats 30
[...]
OOP_solver.prof% sort tottime
OOP_solver.prof% stats 30

And here are the results, first for cumulative time:

Mon Mar  6 10:11:01 2023    OOP_solver.prof

         252249901 function calls (252235248 primitive calls) in 114.974 seconds

   Ordered by: cumulative time
   List reduced from 696 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     44/1    0.000    0.000  115.336  115.336 {built-in method builtins.exec}
        1    0.000    0.000  115.336  115.336 benchmark.py:1(<module>)
        1    0.066    0.066  115.315  115.315 benchmark.py:15(run_benchmark)
    10000    0.038    0.000  114.444    0.011 OOP_sudoku.py:4(read_and_solve_sudoku_from_string)
24019/10000  0.170    0.000  112.645    0.011 OOP_sudoku.py:80(recursive_solver)
    84712    9.148    0.000   78.545    0.001 OOP_sudoku.py:171(reduce_possibilities)
  2135653    6.198    0.000   66.217    0.000 OOP_sudoku.py:197(get_simple_mask)
  6406959   21.542    0.000   44.162    0.000 OOP_sudoku.py:54(extract_exclusions)
    84712    1.542    0.000   29.708    0.000 OOP_sudoku.py:137(still_solvable)
  2287224   11.333    0.000   19.849    0.000 OOP_sudoku.py:44(collision_in_collection)
 88109228   17.485    0.000   17.639    0.000 {built-in method builtins.sum}
  2898061    9.392    0.000   11.871    0.000 OOP_sudoku.py:113(block)
 46275920    6.938    0.000    6.938    0.000 {method 'index' of 'list' objects}
 62948777    6.190    0.000    6.190    0.000 {method 'append' of 'list' objects}
  6406959    5.310    0.000    5.310    0.000 OOP_sudoku.py:56(<listcomp>)
  2898061    1.881    0.000    4.605    0.000 OOP_sudoku.py:110(col)
    84712    0.745    0.000    3.981    0.000 {built-in method builtins.all}
  6664729    1.971    0.000    3.238    0.000 OOP_sudoku.py:156(<genexpr>)
  2135653    3.202    0.000    3.202    0.000 OOP_sudoku.py:210(<listcomp>)
  6406959    2.807    0.000    2.807    0.000 OOP_sudoku.py:62(<listcomp>)
    24019    1.178    0.000    2.748    0.000 OOP_sudoku.py:241(__str__)
  2898061    2.724    0.000    2.724    0.000 OOP_sudoku.py:111(<listcomp>)
    14019    0.031    0.000    2.444    0.000 OOP_sudoku.py:231(copy)
  2824460    1.079    0.000    1.987    0.000 OOP_sudoku.py:19(p_array_to_num)
    24019    0.024    0.000    1.425    0.000 OOP_sudoku.py:76(__init__)
    24019    0.732    0.000    1.401    0.000 OOP_sudoku.py:28(convert_sudoku_string_to_p_grid)
    16974    0.292    0.000    0.937    0.000 OOP_sudoku.py:127(first_unsolved_cell_index)
     4295    0.814    0.000    0.814    0.000 OOP_sudoku.py:234(import_p_grid)
    10001    0.018    0.000    0.771    0.000 std.py:1174(__iter__)
      795    0.015    0.000    0.750    0.001 std.py:1212(update)

and then for total time:

Mon Mar  6 10:11:01 2023    OOP_solver.prof

         252249901 function calls (252235248 primitive calls) in 114.974 seconds

   Ordered by: internal time
   List reduced from 696 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  6406959   21.542    0.000   44.162    0.000 OOP_sudoku.py:54(extract_exclusions)
 88109228   17.485    0.000   17.639    0.000 {built-in method builtins.sum}
  2287224   11.333    0.000   19.849    0.000 OOP_sudoku.py:44(collision_in_collection)
  2898061    9.392    0.000   11.871    0.000 OOP_sudoku.py:113(block)
    84712    9.148    0.000   78.545    0.001 OOP_sudoku.py:171(reduce_possibilities)
 46275920    6.938    0.000    6.938    0.000 {method 'index' of 'list' objects}
  2135653    6.198    0.000   66.217    0.000 OOP_sudoku.py:197(get_simple_mask)
 62948777    6.190    0.000    6.190    0.000 {method 'append' of 'list' objects}
  6406959    5.310    0.000    5.310    0.000 OOP_sudoku.py:56(<listcomp>)
  2135653    3.202    0.000    3.202    0.000 OOP_sudoku.py:210(<listcomp>)
  6406959    2.807    0.000    2.807    0.000 OOP_sudoku.py:62(<listcomp>)
  2898061    2.724    0.000    2.724    0.000 OOP_sudoku.py:111(<listcomp>)
  6664729    1.971    0.000    3.238    0.000 OOP_sudoku.py:156(<genexpr>)
  2898061    1.881    0.000    4.605    0.000 OOP_sudoku.py:110(col)
    84712    1.542    0.000   29.708    0.000 OOP_sudoku.py:137(still_solvable)
    24019    1.178    0.000    2.748    0.000 OOP_sudoku.py:241(__str__)
  2824460    1.079    0.000    1.987    0.000 OOP_sudoku.py:19(p_array_to_num)
     4295    0.814    0.000    0.814    0.000 OOP_sudoku.py:234(import_p_grid)
    84712    0.745    0.000    3.981    0.000 {built-in method builtins.all}
    24019    0.732    0.000    1.401    0.000 OOP_sudoku.py:28(convert_sudoku_string_to_p_grid)
  2898061    0.513    0.000    0.513    0.000 OOP_sudoku.py:107(row)
4577538/4577451    0.494    0.000    0.494    0.000 {built-in method builtins.len}
      799    0.373    0.000    0.373    0.000 {method 'write' of '_io.TextIOWrapper' objects}
    16974    0.292    0.000    0.937    0.000 OOP_sudoku.py:127(first_unsolved_cell_index)
  1945539    0.200    0.000    0.200    0.000 {method 'copy' of 'list' objects}
24019/10000    0.170    0.000  112.645    0.011 OOP_sudoku.py:80(recursive_solver)
   464825    0.110    0.000    0.153    0.000 utils.py:330(<genexpr>)
    24019    0.108    0.000    0.670    0.000 OOP_sudoku.py:31(<listcomp>)
        1    0.066    0.066  115.315  115.315 benchmark.py:15(run_benchmark)
   462434    0.044    0.000    0.044    0.000 {built-in method unicodedata.east_asian_width}

For now I think the best place to focus on is the top of the total time, in particular these lines:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  6406959   21.542    0.000   44.162    0.000 OOP_sudoku.py:54(extract_exclusions)
 88109228   17.485    0.000   17.639    0.000 {built-in method builtins.sum}
  2287224   11.333    0.000   19.849    0.000 OOP_sudoku.py:44(collision_in_collection)
  2898061    9.392    0.000   11.871    0.000 OOP_sudoku.py:113(block)
    84712    9.148    0.000   78.545    0.001 OOP_sudoku.py:171(reduce_possibilities)
 46275920    6.938    0.000    6.938    0.000 {method 'index' of 'list' objects}
  2135653    6.198    0.000   66.217    0.000 OOP_sudoku.py:197(get_simple_mask)
 62948777    6.190    0.000    6.190    0.000 {method 'append' of 'list' objects}

So here we see that the biggest time consumers consist of a few built-in methods - sum, list.index and append - together with a few methods we've built, namely extract_exclusions, collisions_in_collection, block, reduce_possibilities and get_simple_mask.

Looking at how these transgressors interact, we note that the appending to list only happens in the methods collisions_in_collection, block, reduce_possibilities and the magic method __str__ which is used in the copying of a sudoku. The built-in sum method is also used only in methods listed above, together with another static method called p_array_to_num which has also found its way to the top 30 of both lists.

So looking how these different subparts consume our CPU time, we note that there might be chances to improve. For example, looking at our extract_exclusions method which holds top rank in both total and cumulative times, we see that we are using the sum only to detect if a given possibility array is already completely determined. And in fact, most of the uses of sum are only used to figure out if a given possibility array has only a single '1' in it. This seeems like an overkill.

So instead of sums, let's do a faster lookup. There are only 9 possibilities for an array of length 9 to have only one '1', so let's enumerate them and turn the check into a dictionary lookup. To do this, though, we need something hashable, so we need to turn from lists to tuples. Let's go one step further: we'll change the whole Sudoku data format from 3-dimensional list to a 2-dimensional list of tuples!

So what we changed going from OOP_sudoku.py to OOP_sudoku_improved.py was: 1. Turned the whole "from possibility array to number and back" thing to rely on dictionary lookups of pre-stored tuples. 2. Changed the internal sudoku format to have all the possibility arrays to be tuples instead of lists. 3. Reduced for loops by adding more early returns, plus other minor improvements.

And here are the results:

	s/Sudoku	10k Sudokus
Naive Python	5.1e-03	51s
Naive C	5.4e-04	5.2s
OOP Python	4.9e-03	49s
Improved OOP Python	4.7e-03	47s

So with all that work we shaved about four percents off our running time. cProfiler tells that for the total time the biggest time users are now:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  6406959   20.422    0.000   33.407    0.000 OOP_sudoku_improved.py:47(extract_exclusions)
 81526820   16.343    0.000   16.343    0.000 OOP_sudoku_improved.py:35(p_array_to_num)
  2287224    9.319    0.000   15.190    0.000 OOP_sudoku_improved.py:73(collision_in_collection)
    84712    9.137    0.000   65.780    0.001 OOP_sudoku_improved.py:179(reduce_possibilities)
  2898061    8.713    0.000   10.813    0.000 OOP_sudoku_improved.py:124(block)
  2135653    5.703    0.000   53.776    0.000 OOP_sudoku_improved.py:205(get_simple_mask)
 62948777    5.046    0.000    5.046    0.000 {method 'append' of 'list' objects}
  2135653    3.099    0.000    3.099    0.000 OOP_sudoku_improved.py:218(<listcomp>)
  6406959    2.731    0.000    2.731    0.000 OOP_sudoku_improved.py:49(<listcomp>)
  2898061    2.562    0.000    2.562    0.000 OOP_sudoku_improved.py:122(<listcomp>)
  2898061    1.776    0.000    4.338    0.000 OOP_sudoku_improved.py:121(col)

So one of the big differences is that we pushed the 17 or so seconds used by sum to the p_array_to_num function here. So I guess the lesson here is that sum(short_list) == 1 is also very fast as it is on par with essentially hash-table lookups? We might even conjecture that Python's built-in functions might be quite heavily optimized.

Round 3 - Numpy

So, our first speedup attempts had modest results. Maybe we've used up all the easy tricks in relying on fast parts of standard Python? Let's try and see if we can do more by using the fast Python library, i.e. Numpy. The logic behind this approach is that Sudoku is pretty much a glorified matrix, and numpy is good with matrices. This should be a no-brainer. In ./src/numpy_sudoku.py we have the OOP solver rewritten in such a way that the sudoku is not a 3-dimensional list but a 3-dimensional numpy-integer array.

	s/Sudoku	10k Sudokus
Naive Python	5.1e-03	51s
Naive C	5.4e-04	5.2s
OOP Python	4.9e-03	49s
Improved OOP Python	4.7e-03	47s
First numpy solver	3.3e-02	330s

So the first attempt to numpy solve this is that we spend an order of magnitude more solving it. This is pretty much the opposite of what I wanted. Let's fire up the good old cProfile to see what is going on.

From numpy_solver.prof we get again:

Tue Mar  7 11:28:23 2023    numpy_solver.prof

         456300559 function calls (456284372 primitive calls) in 515.038 seconds

   Ordered by: internal time
   List reduced from 1230 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 28669199   65.544    0.000   65.544    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    84712   53.859    0.001  395.698    0.005 numpy_sudoku.py:159(reduce_possibilities)
  6406959   52.711    0.000   67.175    0.000 function_base.py:5054(delete)
  8694183   38.006    0.000  217.958    0.000 numpy_sudoku.py:60(get_fixed_projection_array)
 50766043   36.277    0.000  240.813    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
 19861988   26.811    0.000   87.349    0.000 fromnumeric.py:69(_wrapreduction)
  8807211   22.568    0.000   56.763    0.000 numpy_sudoku.py:20(p_array_to_num)
 17490052   20.292    0.000  102.563    0.000 fromnumeric.py:2188(sum)
  2898061   17.290    0.000   23.512    0.000 numpy_sudoku.py:110(block)
 17490052   13.594    0.000  128.313    0.000 <__array_function__ internals>:177(sum)
  6406959   10.396    0.000  252.591    0.000 numpy_sudoku.py:53(extract_exclusions)
  2135653   10.322    0.000   10.322    0.000 numpy_sudoku.py:198(<listcomp>)
  8694183   10.192    0.000   10.192    0.000 {method 'take' of 'numpy.ndarray' objects}

Tue Mar  7 11:28:23 2023    numpy_solver.prof

         456300559 function calls (456284372 primitive calls) in 515.038 seconds

   Ordered by: cumulative time
   List reduced from 1230 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    141/1    0.000    0.000  515.038  515.038 {built-in method builtins.exec}
        1    0.000    0.000  515.038  515.038 benchmark.py:1(<module>)
        1    0.048    0.048  514.884  514.884 benchmark.py:20(run_benchmark)
    10000    0.048    0.000  513.466    0.051 numpy_sudoku.py:5(read_and_solve_sudoku_from_string)
24019/10000    0.216    0.000  500.796    0.050 numpy_sudoku.py:77(recursive_solver)
    84712   53.859    0.001  395.698    0.005 numpy_sudoku.py:159(reduce_possibilities)
  2135653   10.143    0.000  294.213    0.000 numpy_sudoku.py:185(get_simple_mask)
  6406959   10.396    0.000  252.591    0.000 numpy_sudoku.py:53(extract_exclusions)
 50766043   36.277    0.000  240.813    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
  8694183   38.006    0.000  217.958    0.000 numpy_sudoku.py:60(get_fixed_projection_array)
 17490052   13.594    0.000  128.313    0.000 <__array_function__ internals>:177(sum)
 17490052   20.292    0.000  102.563    0.000 fromnumeric.py:2188(sum)
 19861988   26.811    0.000   87.349    0.000 fromnumeric.py:69(_wrapreduction)
    84712    2.751    0.000   86.997    0.001 numpy_sudoku.py:124(still_solvable)
  6406959    5.251    0.000   78.532    0.000 <__array_function__ internals>:177(delete)
  2287224    6.429    0.000   75.701    0.000 numpy_sudoku.py:47(collision_in_collection)
  6406959   52.711    0.000   67.175    0.000 function_base.py:5054(delete)
 28669199   65.544    0.000   65.544    0.000 {method 'reduce' of 'numpy.ufunc' objects}
  8807211   22.568    0.000   56.763    0.000 numpy_sudoku.py:20(p_array_to_num)
  8694183    6.475    0.000   36.872    0.000 <__array_function__ internals>:177(take)
 14855826    9.852    0.000   25.918    0.000 <__array_function__ internals>:177(where)

We see especially from the total time calculations that we are spending a lot of time doing numpy-internal methods like reduce, delete and implement_array_function. My very non-expert feeling here is that we are paying a lot for the overhead of the numpy machinery that allows us to make really fast computations for large dimensional linear algebra calculations? Also Numpy tends to excel for computations that can be vectorized, and here we are doing a lot of for loops instead. So my current guess² is that not using numpy for vectorized calculations is killing our speed improvements and numpy overhead is pushing us to the red.

Sidenote - Using profiling to test for recursion paths

Note that in this line:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
24019/10000    0.216    0.000  500.796    0.050 numpy_sudoku.py:77(recursive_solver)

we can easily see from ncalls that the main recursive solver was called 10k times at the outset, and then 24k times internally. This gives not only an average recursion depth, but seeing that this number is the same for all the different versions of Python sudoku solvers we've used, we have strong evidence that they are using essentially the same recursive algorithm with same recursion depths. So the problem with the numpy version shouldn't be some subtle mistake about the recursive part not exiting early enough, but just about the numpy version not being faster.

Interlude - What did we learn here?

Premature optimization is the root of all evil.

So, before actually trying to do this, my knee-jerk reaction would have been to do this with numpy since numpy is fast with matrices. Turns out that that would have been a bad idea. Especially if I hadn't had a resonable naive version as a base case comparison.

In hindsight maybe the library that is built to handle 1 000 000 -dimensional float arrays and matrix multiplications of similar dimensions has some overhead when used to handle a 9x9x9 matrix of ones and zeroes? Who knew! (Option two: I'm doing something stupid with numpy here. At some point I'll go find³ numpy-based sudoku solvers of other people and see how they did it - probably the idea is to pivot your viewpoint to something less human-approach -centric.)

Chapter 2 - Using C within Python

If you can't beat them, join them! Next we'll try and see how to use the superiorly fast C in Python versions. Our benchmarking script already has two ways of invoking the C-code, both in the badly named ./src/CLI_C_caller.py file and both using the os.popen to run a bash command that invokes the compiled executable of the naive C-version.

The one we've used so far passes the test data csv filepath as a parameter:

def run_C_benchmark(test_data_filepath) -> str:
    stream = os.popen(f'./C-version/a.out -B "{test_data_filepath}"')
    output = stream.read()
    return output

Here the C-code is responsible for reading the file contents.

Another way would be to pass not the filepath to read but the Sudoku to solve as a string:

def read_and_solve_sudoku_from_string(input_sudoku: str) -> str:
    stream = os.popen(f"./C-version/a.out -r {input_sudoku} -S")
    output = stream.read()
    return output.strip("\n")

This relies on Python to read the file and then passes the Sudokus to the C-part one by one.

Let's see how much time we lose in piping the data from Python to C through the command line one-by-one:

	s/Sudoku	10k Sudokus
Naive Python	5.1e-03	51s
C - CSV	5.4e-04	5.7s
C - line-by-line	2.4e-03	49s

This seems to imply that the overhead of piping the data through popen/bash is quite large. So the the natural step feels to be to transform the C-system into something we can call from Python directly!

After studying the Real Python tutorial on the topic, it was almost easy⁴ to turn my C-code into a Python package. It's not great C nor great Python, but it works and that's what matters for now. Anyway. ./C-version/sudoku_solver_module.c contains the module that can be turned into a Python package called RamiSudoku. To do this, after activating your virtual environment, run python setup.py install in the folder ./C-version.

How's the speed now then?

	s/Sudoku	10k Sudokus
Naive Python	5.3e-03	53s
C - CSV	5.7e-04	5.7s
C - line-by-line	2.4e-03	24s
C based Python package	1.1e-04	1.1s

So, somewhat surprisingly, the Python-packaged version is five times faster than my naive C. I'm guessing here that the Python file reader was written by an actually competent C-developer, unlike my version of getting the test and comparison data from the csv. I do not currently have the energy to learn how to profile my C-code to verify if this really is the case.

Python 3.11

Speaking of competent developers, the word on the street is that Python 3.11 can provide considerable speed improvements compared to earlier versions. If I recall correctly, one of the main improvements has to do with more clever handling of data structures. This sounds like something that could benefit our algorithms since our main objects are really just lists/tuples of a specific type, not of generic objects. Let's run the benchmarks again but compare to Python 3.11 when suitable!

	10k Sudokus w/ 3.10	10k Sudokus w/ 3.11	Speed increase
Naive Python	53s	42s	23%
OOP Python	50s	40s	25%
Improved OOP Python	48s	37s	30%
First numpy solver	350s	320s	9%
C - CSV	5.7s	5.4s	5%
C - line-by-line	24s	24s	0%
C based Python package	1.1s	1.1s	0%

Not bad, 25-ish percent improvements on the more Python-heavy solutions, and even a 5% improvement on the wrapped C-benchmark.

Epilogue

What did we learn?

We learned a few things:

C is faster than Python when in the hands of a competent dev. My Python-packaged C-solver is faster than my raw C-solver, which clearly indicates that I am doing something silly in my C-version.
Read tutorials through and don't start to solo around. But if you don't and do, then when running to weird error messages, try to go back to patiently following the tutorial.
Don't (only) guess, always try out and test. I would not have guessed Numpy-version to be so slow or the C-extension module to be faster than my naive C-version.

What could be done better?

First of all, the algorithm is begging for improvements. E.g. the relatively minor change where the brute-force solver does not look for the first unsolved cell but the cell with least possibilities gives an immediate speedup by reducing the recursion count. A better approach would be to try and add some more clever reduction rules to the solver, and better yet would be to stop trying to solve this as a human would and use something much more cool.

We didn't. ↩
My first guess was that the numpy version would be much faster, so observing almost an order of magnitude slowdown we can deduce that my guesses on numpy are not to be trusted. ↩
Sure I will. ↩
As a caveat to those trying to repeat this: you're not supposed to compile gcc sudoku_solver_module.c, it won't work and you'll spend hours trying to troubleshoot a wrong thing. ↩