Bundle Adjustment – Part 3: Simulation Results

In the previous posts we considered how to calculate the Jacobians for Bundle Adjustment (BA) and their relative magnitudes and some of their properties. In this post, we’ll look at some practical implementation details of implementing BA.

Let’s quickly recap the problem we are trying to solve. Assume each camera j is parameterized by a vector a_j and each 3D point i by a vector b_i. Then, Bundle Adjustment involves minimizing the following objective function:

\min_{a_j,b_i}\sum\limits_{i=1}^n\sum\limits_{j=1}^m d(Q(a_j, b_i), x_{ij})^2

Here, Q(a_j, b_i) is the predicted projection of point i on the image j and d(x,y) denotes the Euclidean distance between the image points represented by the vectors x and y. The minimization is performed with respect to vectors a_j and b_i. The vector a_j consists of the rotation and translation parameters to transform the world point in the camera coordinate system and the intrinsic parameters of the camera. The intrinsic parameters consist of the focal length (in the x and y dimensions), coordinates of the center of projection and the skew factor (if any). The vector b_i consists of the 3D coordinates of point i.

Together, these parameters are combined as follows to project a 3D world point on the image using the standard perspective projection equation.

\begin{bmatrix}wu\\wv\\w\end{bmatrix} = \begin{bmatrix}f_x & s_k &u_0 \\0 & f_y & v_0 \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}R & T \end{bmatrix}\begin{bmatrix}X \\ Y \\ Z \\ 1 \end{bmatrix}

Henceforth, we’ll assume that our camera is a rectilinear camera with square pixels. Thus, f_x = f_y and s_k = 1.

Bundle Adjustment methods typically employ the Levenberg Marquardt (LM) algorithm to find the minimum of the optimization function. Levenberg–Marquardt algorithm solves the least squares curve fitting problem: given a set of m pairs of independent and dependent variables, (x_i, y_j) find the parameters  of the function f(x, \boldsymbol{\beta}) so that the sum of the squares of the deviations is minimized:

\hat{\beta} = argmin_{\beta}\sum\limits_{i=1}^m [y_i - f(x_i, \ \boldsymbol \beta) ]^2

As shown in 1, the least squares solution to this problem is obtained by solving a series of linear equations expressed in matrix form:

\boldsymbol{J^T}\boldsymbol{J}\boldsymbol{\delta}= \boldsymbol{J^T}(\boldsymbol{y} - \boldsymbol{f(\beta)})

Here \boldsymbol{f} and \boldsymbol{y} are vectors with i^{th} component f(x_i,\beta) and y_i respectively.

Levenberg’s contribution is to add a damping factor \lambda to the equation above, so that we are solving:

(\boldsymbol{J^T}\boldsymbol{J} + \boldsymbol{\lambda I})\boldsymbol{\delta}= \boldsymbol{J^T}(\boldsymbol{y} - \boldsymbol{f(\beta)})

\lambda is adjusted at each iteration. If at the end of an iteration, the objective function is lower than before, \lambda can be lowered, otherwise it is increased.

Marquardt provided the insight that we can scale each component of the gradient according to the curvature so that there is larger movement along the directions where the gradient is smaller. This avoids slow convergence in the direction of small gradient. Therefore, Marquardt replaced the identity matrix I, with a diagonal matrix consisting of the diagonal elements of \boldsymbol{J^T}\boldsymbol{J} resulting in the Levenberg–Marquardt algorithm:

(\boldsymbol{J^T}\boldsymbol{J} + \boldsymbol{\lambda \textrm{diag}(J^TJ)})\boldsymbol{\delta}= \boldsymbol{J^T}(\boldsymbol{y} - \boldsymbol{f(\beta)})

We’ll examine the effects of using the levenberg and Marquardt modifications when we look at our simulation results.

Let’s now examine the J matrix that arises when bundle adjustment is applied to images resulting from perspective projection. Let’s assume we have:

  • N 3D object points
  • M cameras
  • P parameters for each camera.

To keep things simple, we’ll assume that each object point is visible in each camera (this assumption is not required and we’ll briefly consider the changes necessary to relax this assumption). The number of parameters for each camera is a combination of the camera intrinsic parameters such as camera center and focal length and extrinsic parameters such as the camera position and orientation in the world coordinate system. In our simulations, we assume that pixels are square and there is no skew. Thus the camera parameters consist of the axis-angle representation of the camera pose (w_x, w_y, w_z), the camera – object translation vector (t_x, t_y, t_z) and the focal length (f) and camera center (u_0, v_0).

Let us use the following notation:

  • \boldsymbol {x_{ij}}: image coordinates of point i in image j (2 \times 1)
  • \boldsymbol{a}: vector consisting of the camera parameters (9 \times 1)
  • \boldsymbol{b}: 3D object point vector (3 \times 1)
  • A_{ijk} is the matrix consisting of the partial derivatives of the image coordinates of point i in the image j with the parameters of the camera k. Thus, A_{ijk} = \frac{\partial{x_{ij}}}{\partial{a_k}}. In this post, we assume that all cameras are independent, and therefore changes in the parameters of one camera only affects the image taken by that camera, and has no effect on the other images. Therefore, A_{ijk} = 0 \forall j \neq k. Each A_{ij} is a 9 \times 2 matrix laid out as follows:

A_{ij} = \begin{bmatrix} \frac{\partial{x_{ij}}}{\partial{w_{xj}}} & \frac{\partial{x_{ij}}}{\partial{w_{yj}}} & \frac{\partial{x_{ij}}}{\partial{w_{zj}}} & \frac{\partial{x_{ij}}}{\partial{t_{xj}}} & \frac{\partial{x_{ij}}}{\partial{t_{yj}}} & \frac{\partial{x_{ij}}}{\partial{t_{zj}}} & \frac{\partial{x_{ij}}}{\partial{f_{j}}} & \frac{\partial{x_{ij}}}{\partial{u_{0j}}} & \frac{\partial{x_{ij}}}{\partial{v_{0j}}} \\ \frac{\partial{y_{ij}}}{\partial{w_{xj}}} & \frac{\partial{y_{ij}}}{\partial{w_{yj}}} & \frac{\partial{y_{ij}}}{\partial{w_{zj}}} & \frac{\partial{y_{ij}}}{\partial{t_{xj}}} & \frac{\partial{y_{ij}}}{\partial{t_{yj}}} & \frac{\partial{y_{ij}}}{\partial{t_{zj}}} & \frac{\partial{y_{ij}}}{\partial{f_{j}}} & \frac{\partial{y_{ij}}}{\partial{u_{0j}}} & \frac{\partial{y_{ij}}}{\partial{v_{0j}}}\end{bmatrix}

The various A matrices laid out in 3D look like this:

It’s easy to see that the matrix is highly sparse as changing the parameters of one camera only affects the image taken by that camera.

We can unroll the third dimension of the matrix and append the sub matrices corresponding to different points columnwise, so that the camera parameter part of the jacobian matrix looks like this:

Let’s now consider the variation of image coordinates with the 3D object points. B_{ijk} denotes the matrix consisting of the partial derivatives of the image coordinates of point i in the image j with the coordinates of the object point k. Thus, B_{ijk} = \frac{\partial{x_{ij}}}{\partial{b_k}}. Since change in the 3D position of one point doesn’t effect the image coordinates of another point, B_{ijk} = 0 \forall k \neq i. Each B_{ij} is a 3 \times 2 matrix laid out as follows:

B_{ij} = \begin{bmatrix} \frac{\partial{x_{ij}}}{\partial{X_{i}}} & \frac{\partial{x_{ij}}}{\partial{Y_{i}}} & \frac{\partial{x_{ij}}}{\partial{Z_{i}}} \\ \frac{\partial{y_{ij}}}{\partial{X_{i}}} & \frac{\partial{y_{ij}}}{\partial{Y_{i}}} & \frac{\partial{y_{ij}}}{\partial{Z_{i}}}\end{bmatrix}

 

Appending the two components of the jacobian matrix, the full Jacobian matrix looks like this:

The vector \boldsymbol{x} consists of the image coordinates laid out in a column vector.

\boldsymbol{x} = \lbrack x_1, y_1, x_2, y_2 \ldots x_{N\times M}, y_{N\times M} \rbrack^T


Simulations

Before I describe the process and results of my simulations, let’s note one implementation detail. In my implementation, I separated the x and y components of the Jacobians so that the Jacobian matrix looks like this:

T

 The \boldsymbol{x} vector also needs to be rearranged accordingly. It now looks like this:

\boldsymbol{x} = \lbrack x_1, x_2 \ldots x_{N\times M}, y_1, y_2 \ldots y_{N\times M}\rbrack^T

In my simulations, I considered 8 points located at the corners of a cube being viewed by multiple cameras. The set-up looks like this (only showing two cameras):

 

I carried out two set of simulations. In the first set, only the intrinsic and extrinsic parameters of the cameras were optimized. The position of the 3D points was not subject to the optimization. Thus, the jacobian matrix only consisted of the A_{ij} submatrices. In the second set, all parameters, including the position of the 3D points were part of the optimization.

Let us now look at the Matlab code for the first set.


Simulation Results

Let’s now look at the results of our simulations. The goal is to evaluate the relationship between image noise level and the difference between optimized camera parameters and the true camera parameters. We don’t optimize over the 3D position of the object points which are assumed to be known with high accuracy. This scenario models the optimization carried out during camera calibration, where the position of the object points (corners of the calibration pattern) is known through direct measurements. One difference between the camera calibration scenario and this simulation is that it is customary to use a planar calibration pattern during camera calibration, and thus the image coordinates on different images of the pattern are related to each other by a homography. In our simulation, the object points are non-coplanar and their image coordinates follow the epipolar constraint and are not related by a homography.

We expect that as noise is added to the image, the distance between the optimized camera parameters and the true parameters would increase. This is indeed borne out by our simulation, as shown in the plots below.

The pseudo-code for the simulation is shown below.

In our simulations, we use points located on the corners of 3 planes displaced from each other by a fixed distance, leading to 12 points in total. These points are viewed by 4 cameras. The view from one of the cameras is shown below.

No image space scaling or clipping is applied and all points are assumed to be visible from all 4 poses. Since no image space scaling is applied, the image coordinates are numerically much smaller than they normally are (of the order of 0.1 instead of 100’s).

The number of independent equations are 2*12*4  = 96, while the number of parameters are 4*9 = 36 (6 extrinsic and 3 intrinsic parameters for each camera).

Calculating the RMS Errors

Denoting the optimized value of parameter P by P_i^{BA} and original value by P_i^{orig}, the RMS error is given by:

\sqrt(\frac{\sum\limits_{i=0}^{numCameras} (P_i^{BA}-P_i^{orig})\cdot(P_i^{BA}-P_i^{orig})^T}{numCameras})

For orientation, the rotation error matrix is first calculated (R_{err} = R_{orig}R_{BA}^T) and the error matrix is converted into the axis-angle representation. The axis-angle vector is then used as a regular vector in the expression for RMS error above.

Simulation Results

 

As we can see, the error in the estimated intrinsic and extrinsic parameters increases with the standard deviation of the image noise. However even in the presence of significant noise, bundle adjustment makes improvements to the original estimate of the extrinsic and intrinsic parameters.

Constrained Optimization

In our formulation of Bundle Adjustment, we treat each camera as independent and hence the optimization converges to different values of the extrinsic and intrinsic parameters for each camera. However, in the vast majority of use cases, a single camera is used to capture multiple images from different poses. The extrinsic parameters for these poses are different, but the intrinsic parameters are the same. Thus, it would make sense to modify the Jacobian matrix to add this additional constraint. The constraint is enforced by recognizing that the image coordinates in all images vary with the a change in the intrinsic parameters, not just the coordinates in a particular image.

 

In the figure above, we have broken an individual jacobian entry into its extrinsic and intrinsic component, and also shown the point, image and camera indicies separately. The variation of the image coordinates with the object points (B_{ij}) don’t change.

Adding this constraint leads to significant improvements in the minimum achieved by the optimization, as shown in the plots above. The improvement is particularly significant for the camera intrinsic parameters. The unconstrained optimization ceases to make improvements to the initial guess for the intrinsic parameters, even in the presence of a small amount of noise, whereas the constrained optimization continues to make improvements.

Reprojection Error

As shown in the graph above, the constrained optimization performs better than the unconstrained optimization in minimizing the reprojection error. Indeed, the reprojection error achieved by the constrained optimization is often lower than the “best case” reprojection error, which is calculated by using the image coordinates obtained by projecting the object points using the correct pose and the noisy image points. It is possible for the overall reprojection error to be lower for poses different from the “correct” poses depending on the geometry at hand.

Relative Values of the Orientation/Translation/Intrinsics Jacobians

In this section, I’ll attempt to provide some insights on the relative values of the jacobians of the orientation, translation and intrinsic parameters. Consider a point X with coordinates (10, 0, -10) in the local object coordinate system. For simplicity, we’ll assume that the object coordinate system is aligned with the camera coordinate system and there is a translation of (0, 0, 60) between the two coordinate systems. Let’s also assume that the camera under consideration is a normalized camera with focal length and image center = (1,0,0). The image coordinates of X are given by:

u = f\frac{X+T_x}{Z+T_z}+u_0

v = f\frac{Y+T_y}{Z+T_z}+v_0

From the figure below,

\frac{dX}{d\theta}=Rcos(\theta)

For simplicity, let’s represent the rotation of the object coordinate system by Euler angles, with (\phi, \theta, \psi) denoting the angle of rotation about the (x,y,z) axes. Let’s consider the first row of Jacobian matrix of the camera parameters.

\begin{bmatrix}\frac{\partial{u}}{\partial{\bf{a}}}\end{bmatrix} = \begin{bmatrix} \frac{\partial{u}}{\partial{f}} & \frac{\partial{u}}{\partial{u_0}} & \frac{\partial{u}}{\partial{v_0}} & \frac{\partial{u}}{\partial{\phi}} & \frac{\partial{u}}{\partial{\theta}} & \frac{\partial{u}}{\partial{\psi}} & \frac{\partial{u}}{\partial{T_x}} & \frac{\partial{u}}{\partial{T_y}} & \frac{\partial{u}}{\partial{T_z}}\end{bmatrix}

Considering only the rotation about the y axis (the analysis of rotation about other axes is similar),

\begin{bmatrix}\frac{\partial{u}}{\partial{\bf{a}}}\end{bmatrix} = \begin{bmatrix} \frac{X+T_x}{Z+T_z} & 1 & 0 & \frac{fRCos(\theta)}{Z+T_z} & \frac{f}{Z+T_z} & 0 & -\frac{f(X+T_x)}{(Z+T_z)^2}\end{bmatrix}

= \begin{bmatrix} 0.2 & 1 & 0 & 0.2 & 0.02 & 0 & -0.004 \end{bmatrix}

Now let’s consider the first row of the Jacobian of the Object points. Denoting the coordinates of X in the camera coordinate system by X^\prime,

\begin{bmatrix} \frac{\partial{u}}{\partial{\bf{b}}} \end{bmatrix} = \begin{bmatrix} \frac{\partial{u}}{\partial{X^\prime}} & \frac{\partial{u}}{\partial{Y^\prime}} & \frac{\partial{u}}{\partial{Z^\prime}} \end{bmatrix} = \begin{bmatrix} \frac{f}{Z+T_z} & 0 & -\frac{f(X+T_x)}{(Z+T_z)^2} \end{bmatrix} = \begin{bmatrix} 0.2 & 0 & -0.004 \end{bmatrix}

From these numbers, we can make the following observations:

  • The sensitivity of the image coordinates to the intrinsic parameters is much higher than to the rotation and translation parameters. Among the intrinsic parameters, the image center is the most important.
  • The sensitivity of the image coordinates to the rotation parameters depends upon the angle of rotation. The sensitivity to rotation is lowest for glancing angles (when \theta=90, cos(\theta)=0).
  • The sensitivity of the image coordinates to the object coordinates is of the same order as that to the camera-object translation.

These observations are verified by the actual Jacobian values in our simulations.

Levenberg-Marquardt Optimization

Lastly, we’ll consider the effect of the Levenberg and Marquardt modifications used in the Levenberg Marquardt optimization. Recall that Levenberg contribution was to introduce a damping factor that is increased or decreased after every iteration depending upon whether the objective function is higher or lower than in the previous iteration. Marquardt added a term to scale each component of the gradient so that there is larger movement along the direction where the gradient is smaller. The number of iterations the optimization takes to converge with these improvements is shown in the table below:

Optimization Number of Iterations
No optimization 1200
Levenberg Improvement 20
Marquardt Improvement 9

From this table it is clear that the Levenberg-Marquardt improvements have a significant impact on the speed of convergence. The Marquardt improvement also helps the optimizations avoid local minimums that result from the interplay between focal length and the z component of the translation vector. For some orientations, similar amounts of reduction in the objection function can be achieved by either decreasing/increasing the z component of the translation vector or decreasing/increasing the focal length. The Marquardt optimization discourages the optimization from favoring changes in focal length over changes in the translation.

In the next post, we’ll consider the second second of simulations, where we also optimize over the positions of the 3D object points.

1.
Levenberg–Marquardt algorithm – Wikipedia. en.wikipedia.org. https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm. Published October 25, 2016. Accessed October 25, 2016.
1.
Levenberg–Marquardt algorithm – Wikipedia. en.wikipedia.org. https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm. Published October 25, 2016. Accessed October 25, 2016.

Be the first to comment

Leave a Reply

Your email address will not be published.


*