Differentiating the discrete: Automatic Differentiation meets Integer Optimization
In continuous convex optimization, duality is often the theoretical foundation for computing the sensibility of the optimal value of a problem to one of its parameters. In the nonlinear domain, it is fairly standard to assume one can compute at any point of the domain the function $f(x)$ and gradient $\nabla f(x)$.
What about discrete optimization?
The first thought would be that differentiating
the resolution of a discrete problem does not make sense, the information it yields
since infinitesimal variations in the domain of the variables do not make sense.
However, three cases come to mind for which asking for gradients makes perfect sense:

In mixedinteger linear problems, some variables take continuous values. All linear expressions are differentiable, and every constraint coefficient, righthandside and objective coefficient can have an attached partial derivative.

Even in pureinteger problems, the objective value will be a continuous function of the coefficients, possibly locally smooth, for which one can get the partial derivative associated with each weight.

We might be interested in computing the derivative of some expression of the variables with respect to some parameters, without this expression being the objective.
For these points, some dualitybased techniques and reformulations can be used, sometimes very expensive when the input size grows. One common approach is to first solve the problem, then fixing the integer variables and resolving the continuous part of the problem to compute the dual values associated with each constraint, and the reduced cost coefficients. This leads to solving a NPhard problem, followed by a second solution from scratch of a linear optimization problem, still, it somehow works.
More than just solving the model and computing results, one major use case is embarking the result of an optimization problem into another more complete program. The tricks developed above cannot be integrated with an automated way of computing derivatives.
Automatic Differentiation
Automatic Differentiation is far from new, but has known a gain in attention in the last decade with its used in ML, increasing the usability of the available libraries. It consists in getting an augmented information out of a function.
If a function has a type signature f: a > b
, the goal is, without modifying
the function, to compute a derivative, which is also a function, which to every
point in the domain, yields a linear map from domain to codomain df: a > (a o b)
,
where a o b
denotes a linear map, regardless of underlying representation (matrix, function, …).
See the talk and paper^{1} for a typebased formalism of AD if you are ok with programming language formalism.
Automatic differentiation on a pureJulia solver
ConstraintSolver.jl is a recent
project by Wikunia. As the name indicates, it is a
constraint programming
solver, a more ComputerScienceflavoured approach to integer optimization.
As a Julia solver, it can leverage both multiple dispatch and the type system
to benefit from some features for free. One example of such
feature is automatic differentiation: if your function is generic enough
(not relying on a specific implementation of number types, such as Float64
),
gradients with respect to some parameters can be computed by calling the function
just once (forwardmode automatic differentiation).
Example problem: weighted independent set
Let us consider a classical problem in combinatorial optimization, given an undirected graph $G = (V, E)$, finding a subset of the vertices, such that no two vertices in the subset are connected by an edge, and that the total weight of the chosen vertices is maximized.
Optimization model of the weighted independent set
Formulated as an optimization problem, it looks as follows:
$$\begin{align} (\mathcal{P}): \max_{x} & \sum_{i \in V} w_i x_i \\\\ \text{s.t.} \\\\ & x_i + x_j \leq 1 \,\, \forall (i,j) \in E \\\\ & x \in \mathbb{B}^{V} \end{align} $$
Translated to English, this would be maximizing the weighted sum of picked vertices, which are decisions living in the $V$th dimensional binary space, such that for each edge, no two vertices can be chosen. The differentiable function here is the objective value of such optimization problem, and the parameters we differentiate with respect to are the weights attached to each vertex $w_i$. We will denote it $f(w) = \max_x (\mathcal{P}_w)$.
If a vertex $i$ is not chosen in a solution, there are two cases:
 the vertex has the same weight as at least one other, say $j$, such that swapping $i$ and $j$ in the selected subset does not change the optimal value. of $\mathcal{P}$. In that case, there is a kink in the function, a discontinuity of the derivative, which may not be computed correctly by automatic differentiation. This is related to the phenomenon of degeneracy in the simplex algorithm, multiple variables could be chosen equivalently to enter the base.
 there is no other vertex with the same weight, such that swapping the two maintains the same objective value. In that case, the derivative is $0$, small enough variations of the weight does not change the solution nor the objective.
If a vertex $i$ is chosen in a solution, then $x_i = 1$, and the corresponding partial derivative of the weight is $\frac{\partial f(w)}{\partial w_i} = 1$.
A Julia implementation
We will import a few packages, mostly MathOptInterface.jl (MOI), the foundation for constrained optimization, the solver itself, the Test standard lib, and ForwardDiff.jl for automatic differentiation.
using Test
import ConstraintSolver
const CS = ConstraintSolver
import MathOptInterface
const MOI = MathOptInterface
import ForwardDiff
Let us first write an implementation for the maxweight independent set problem. We will use a 4vertex graph, looking as such:
The optimal answer here is to pick vertices 1 and 4 (in orange).
@testset "Max independent set MOI" begin
matrix = [
0 1 1 0
1 0 1 0
1 1 0 1
0 0 1 0
]
model = CS.Optimizer()
x = [MOI.add_constrained_variable(model, MOI.ZeroOne()) for _ in 1:4]
for i in 1:4, j in 1:4
if matrix[i,j] == 1 && i < j
(z, _) = MOI.add_constrained_variable(model, MOI.GreaterThan(0.0))
MOI.add_constraint(model, z, MOI.Integer())
MOI.add_constraint(model, z, MOI.LessThan(1.0))
f = MOI.ScalarAffineFunction(
[
MOI.ScalarAffineTerm(1.0, x[i][1]),
MOI.ScalarAffineTerm(1.0, x[j][1]),
MOI.ScalarAffineTerm(1.0, z),
], 0.0
)
MOI.add_constraint(model, f, MOI.EqualTo(1.0))
end
end
weights = [0.2, 0.1, 0.2, 0.1]
terms = [MOI.ScalarAffineTerm(weights[i], x[i][1]) for i in eachindex(x)]
objective = MOI.ScalarAffineFunction(terms, 0.0)
MOI.set(model, MOI.ObjectiveFunction{typeof(objective)}(), objective)
MOI.set(model, MOI.ObjectiveSense(), MOI.MAX_SENSE)
MOI.optimize!(model)
# add some tests
end
Why the additional code with(z, _) = MOI.add_constrained_variable(model, MOI.GreaterThan(0.0))
?
ConstraintSolver.jl does not yet support constraints of the type a x + b y <= c
,
but linear equality constraints are fine, so we can derive equivalent formulations by adding a
slack variable z
.
For this problem, the tests could be on both the solution and objective value, as follows:
@test MOI.get(model, MOI.VariablePrimal(), x[4][1]) == 1
@test MOI.get(model, MOI.VariablePrimal(), x[1][1]) == 1
@test MOI.get(model, MOI.ObjectiveValue()) ≈ 0.3
An equivalent JuMP version would look look this:
matrix = [
0 1 1 0
1 0 1 0
1 1 0 1
0 0 1 0
]
m = Model(with_optimizer(CS.Optimizer))
x = @variable(m, x[1:4], Bin)
for i in 1:4, j in i+1:4
if matrix[i,j] == 1
zcomp = @variable(m)
JuMP.set_binary(zcomp)
@constraint(m, x[i] + x[j] + zcomp == 1)
end
end
w = [0.2, 0.1, 0.2, 0.1]
@objective(m, Max, dot(w, x))
optimize!(m)
Why are we not using JuMP, which is much more concise and closer to the mathematical formulation?
JuMP uses Float64
for all value types, which means we do not get the benefit of
generic types, while MathOptInterface
types are parameterized by the numeric type used.
To be fair, maintaining type genericity on a project as large as JuMP
is hard without making performance compromises. JuMP is not built of functions, but
of a model object which contains a mutable state of the problem being constructed,
and building an Algebraic Modelling Language without this incremental build of the
model has not proved successful till now. One day, we may get a powerful declarative
DSL for mathematical optimization, but it has not come yet.
Back to our problem, we now have a way to compute the optimal value and solution. Let us implement our function $f(w)$:
function weighted_stable_set(w)
matrix = [
0 1 1 0
1 0 1 0
1 1 0 1
0 0 1 0
]
model = CS.Optimizer(solution_type = Real)
x = [MOI.add_constrained_variable(model, MOI.ZeroOne()) for _ in 1:4]
for i in 1:4, j in 1:4
if matrix[i,j] == 1 && i < j
(z, _) = MOI.add_constrained_variable(model, MOI.GreaterThan(0.0))
MOI.add_constraint(model, z, MOI.Integer())
MOI.add_constraint(model, z, MOI.LessThan(1.0))
f = MOI.ScalarAffineFunction(
[
MOI.ScalarAffineTerm(1.0, x[i][1]),
MOI.ScalarAffineTerm(1.0, x[j][1]),
MOI.ScalarAffineTerm(1.0, z),
], 0.0
)
MOI.add_constraint(model, f, MOI.EqualTo(1.0))
end
end
terms = [MOI.ScalarAffineTerm(w[i], x[i][1]) for i in eachindex(x)]
objective = MOI.ScalarAffineFunction(terms, zero(eltype(w)))
MOI.set(model, MOI.ObjectiveFunction{typeof(objective)}(), objective)
MOI.set(model, MOI.ObjectiveSense(), MOI.MAX_SENSE)
MOI.optimize!(model)
return MOI.get(model, MOI.ObjectiveValue())
end
We can now compute the gradient in one function call with ForwardDiff:
@testset "Differentiating stable set" begin
weights = [0.2, 0.1, 0.2, 0.1]
∇w = ForwardDiff.gradient(weighted_stable_set, weights)
@test ∇w[1] ≈ 1
@test ∇w[4] ≈ 1
@test ∇w[2] ≈ ∇w[3] ≈ 0
end
To understand how this derivative computation can work with just few function calls (proportional to the size of the input), one must dig a bit deeper in Dual Numbers. I will shamelessly refer to my slides at the Lambda Lille meetup for an example implementation in Haskell.
Why not reversemode?
I mentioned that the cost of computing the value & derivatives is proportional to the size of the input, which can increase rapidly for realworld problems. This is specific to socalled forward mode automatic differentiation. We will not go over the inner details of forward versus reverse. As a rule of thumb, forwardmode has less overhead, and is better when the dimension of the output far exceeds the dimension of the input, while reversemode is better when the dimension of the input exceeds the one of the output.
Giving reverse with Zygote a shot
Getting back to our question, the answer is rather downtoearth,
the reversemode I tried simply did not work there.
Reversemode requires tracing the normal function call, building a
“tape”, this means that it needs a representation of the function
(as a graph or other).
I gave Zygote.jl
a try, which can be done by replacing ForwardDiff.gradient(f,x)
with
Zygote.gradient(f, x)
in the snippet above.
Building a representation of the function means Zygote must have a
representation of all operations performed. For the moment,
this is still restricted to a subset of the Julia language
(which is far more complex than commonly encountered mathematical functions
built as a single expression). This subset still excludes throwing and
handling exceptions, which is quite present in both ConstraintSolver.jl
and MathOptInterface.
I have not tried the other reverse tools for the sake of conciseness (and time), so feel free to check out Nabla.jl, ReverseDiff.jl and Tracker.jl.
How could this be improved?
A first solution could be to move the idiom of Julia from throw/try/catch
to handling errors as values, using something like the Result/Either
type
in Scala / Haskell / Rust and corresponding libraries.
Another alternative, currently happening is to keep pushing Zygote to support more features from Julia, going in the direction of supporting differentiation of any program, as dynamic as it gets.
One last option for the particular problem of exception handling would be
to be able to optout of input validation, with some @validate expr
,
with expr
potentially throwing or handling an error, and a @nocheck
or @nothrows
macro in front of the function call, considering the function
will remain on the happy path and not guaranteeing validity or error messages
otherwise. This works exactly like the @boundscheck
, @inbounds
pair for
index validation.
Conclusion, speculation, prospect
This post is already too long so we’ll stop there. The biggest highlights here are that:
 In discrete problems, we also have some continuous parts.
 Julia’s type system allows AD to work almost out of the box in most cases.
 With JuMP and MOI, solving optimization problems is just another algorithmic building block in your Julia program, spitting out results, and derivatives if you make them.
 I believe that’s why plugging in solvers developed in C/C++ is fine, but not always what we want. I would be ready to take a performance hit on the computation time of my algorithms to have some hackable, typegeneric MILP solver in pure Julia.^{2}
Special mentions
Thanks a lot to Wikunia, first for developing ConstraintSolver.jl, without which none of this would have been possible, and for the open discussion on the multiple issues I posted. Don’t hesitate to check out his blog, where the whole journey from 0 to a constraint solver is documented.

The simple essence of automatic differentiation, Conal Elliott, Proceedings of the ACM on Programming Languages (ICFP), 2018 ↩︎

I believe a pureJulia solver could be made as fast as a C/C++ solver, but developing solvers is an enormous amount of work and microoptimizations, tests on industrial cases. The new HiGHS solver however shows that one can get pretty good results by developing a linear solver from scratch with all modern techniques already baked in. ↩︎