Note on POSIT

The purpose of the this post is to provide my thoughts on what goes on during an iteration of POSIT. This post is not meant to be a detailed description of POSIT. Please refer to 1and 2for a detailed and thorough description of POSIT.posit_explanation

The figure above demonstrates the geometry of the problem. The camera coordinate system is defined by X_C, Y_C, Z_C. The camera center is denoted by C and the image plane is located at a distance of f from the camera center. The center of the image is denoted by c and thus the vector Cc represents the optical axis of the camera. The object point P is known in the object coordinate system denoted by X_o, Y_o, Z_o. There is an unknown rotation and translation of the object coordinate system with respect to the camera coordinate system. The rotation matrix and translation vector representing this transformation are denoted by \bold{R} and \bold{T} respectively.

The vector P is known only in the object coordinate system, not in the camera coordinate system. This means we don’t know the coordinates of the vector P shown in the figure. Indeed, from the definition of the rotation matrix, the coordinates of P in the camera coordinate system are T_x + R_1\cdot P, T_y + R_2\cdot P, T_z + R_3\cdot P.

The point P projects to a point p on the image. The coordinates of p are known. Thus, the known quantities in POSIT are:

  • The coordinates of P in the object coordinate system
  • Camera intrinsic parameters C and f
  • Coordinates of the image point p corresponding to each object point P

From these known quantities, we wish to determine the transformation (R, T) between the camera and the object coordinate system.

First let’s consider the general equation for perspective projection. Following [1],

\begin{bmatrix}wx\\wy\\w\end{bmatrix} = \begin{bmatrix}f\bold{R}_1^T & fT_x \\ f\bold{R}_2^T & fT_y \\ \bold{R}_3^T & fT_z \end{bmatrix}\begin{bmatrix}P \\ 1 \end{bmatrix}

The image coordinates x and y are given by:

x=\frac{f\bold{R}_1^T\cdot P + fT_x}{f\bold{R}_3^T\cdot P + T_z}, y=\frac{\bold{R}_2^T\cdot P + T_y}{\bold{R}_3^T\cdot P + T_z}

Note that P appears both in the numerator and denominator. In the denominator, it adds a contribution equal to the projection of P on the optical axis of the camera. Thus, each image coordinate is scaled in proportion to the distance of the corresponding 3D point from the camera. This is a standard feature of perspective projection. However, because P appears both in the numerator and denominator, we can’t write the equation above in a linear form and apply linear algebra techniques to solve for R and T.

Dividing the numerator and denominator by T_z and denoting \frac{f}{T_z} by s, we obtain:

x=\frac{s\bold{R}_1^T\cdot P + sT_x}{\frac{\bold{R}_3^T\cdot P}{T_z} + 1}, y=\frac{s\bold{R}_2^T\cdot P + sT_y}{\frac{\bold{R}_3^T\cdot P}{T_z} + 1}

Again following the notation in [1], let’s denote \frac{\bold{R}_3^T\cdot P}{T_z} + 1 by w. For object points that are far from the camera and/or lie close to the image plane, w \approx 1. The w's depend on the object point coordinates and the object-camera transformation and are different for each point. Now if somehow we know the values of w for each object point, then we can write the perspective projection equation in a linear format:

x=\frac{s\bold{R}_1^T\cdot P + sT_x}{w}, y=\frac{s\bold{R}_2^T\cdot P + sT_y}{w}

Multiplying by w on both sides and writing in matrix form,

\begin{bmatrix}wx \\ wy \end{bmatrix} = \begin{bmatrix}P & 1 \end{bmatrix}\begin{bmatrix}s\bold{R}_1 & s\bold{R}_2 \\ sT_x & sT_y \end{bmatrix}

Now we can solve for R and T using linear algebra techniques. It is important to understand that because we fixed the w's, solving the linear equation above is not equivalent to solving the general perspective projection equation. Instead, solution to the equation above corresponds to finding the R and T such that the image coordinates of the scaled orthographic projection of the point P on the plane given by z=T_z are (wx, wy).

Let’s now look at iteration t of POSIT. From the previous iteration, we have an estimate of the rotation and translation between the object and the camera coordinate system. Let’s denote this rotation and translation by R^{(t-1)} and T^{(t-1)}. The object point P transformed by this transformation is shown as P^{\rq}_t in the figure above. Its coordinates in the camera coordinate system are T^{(t-1)}_x + R_1^{(t-1)}\cdot P, T^{(t-1)}_y + R_2^{(t-1)}\cdot P, T^{(t-1)}_z + R_3^{(t-1)}\cdot P. From this rotation and translation, we compute new values for the w_k using the formula w_k(t) = \frac{R_3^{(t-1)} \cdot P_k}{T^{(t-1)}_z}+1. Here k is the object point index. Now consider equation 5 in [1]. This equation defines the objection function that is minimized in each iteration of POSIT. The objective function is a sum of d_k defined as:
\| \begin{bmatrix}P_k & 1 \end{bmatrix}\begin{bmatrix}s\bold{R}_1 & s\bold{R}_2 \\ sT_x & sT_y \end{bmatrix} - \begin{bmatrix}w^{(t)}_kx \\ w^{(t)}_ky \end{bmatrix} \|

The \begin{bmatrix}w^{(t)}_kx & w^{(t)}_ky \end{bmatrix} term in the equation above represents the scaled orthographic projection of the point of intersection of the line of sight of image point p with a plane parallel to the image plane passing through the point T^{(t-1)}_z + R_3^{(t-1)}\cdot P (denoted by \Pi^{"}_t). To see this, consider the line of sight through image point p. A point on this line of sight can be represented as cx, cy, cf. Since p is the image of P under perspective projection, P lies on this line of sight, but we don’t know the corresponding c.

The point of intersection of this line of sight with the plane \Pi^{"} can be obtained by setting cf = T^{(t-1)}_z + R_3^{(t-1)}\cdot P. Thus the coordinates of this point of intersection (shown as P_L) are x\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{f}, y\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{f}, T^{(t-1)}_z + R_3^{(t-1)}\cdot P. From the definition of the perspective projection, the image coordinates (denoted by p^{"}_t) of the scaled orthographic projection of this point on the plane at T^{(t-1)}_z are therefore fx\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{T^{(t-1)}_zf}, fy\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{fT^{(t-1)}_z} = w^{(t)}_kx, w^{(t)}_ky.

As stated before, the first term in the definition of d_k corresponds to the scaled orthographic projection of P^{\rq}_t, denoted by p^{\rq}_t. Thus at each iteration of POSIT, we compute rotation and translation such that the distance between the scaled orthographic projections and (wx, wy) are minimized in a least square sense. This makes sense as when we have the correct rotation and translation, points P, P^{L}, P^{\rq} (and thus points p, p^{"}, p^{\rq}) coincide. Thus the vector p^{\rq}p^{"} is a measure of how far we are from the correct pose.

David P, DeMenthon D, Duraiswami R, Samet H. SoftPOSIT: Simultaneous Pose and Correspondence Determination. International Journal of Computer Vision. 2004;59(3):259-284. doi: 10.1023/b:visi.0000025800.10423.1f [Source]
Dementhon DF, Davis LS. Model-based object pose in 25 lines of code. Int J Comput Vision. 1995;15(1-2):123-141. doi: 10.1007/bf01450852 [Source]
David P, DeMenthon D, Duraiswami R, Samet H. SoftPOSIT: Simultaneous Pose and Correspondence Determination. International Journal of Computer Vision. 2004;59(3):259-284. doi:10.1023/b:visi.0000025800.10423.1f
Dementhon DF, Davis LS. Model-based object pose in 25 lines of code. Int J Comput Vision. 1995;15(1-2):123-141. doi:10.1007/bf01450852

Be the first to comment

Leave a Reply

Your email address will not be published.