Note on POSIT

The purpose of the this post is to provide my thoughts on what goes on during an iteration of POSIT. This post is not meant to be a detailed description of POSIT. Please refer to ¹and ²for a detailed and thorough description of POSIT.

The figure above demonstrates the geometry of the problem. The camera coordinate system is defined by $X_C, Y_C, Z_C$ . The camera center is denoted by $C$ and the image plane is located at a distance of $f$ from the camera center. The center of the image is denoted by $c$ and thus the vector $Cc$ represents the optical axis of the camera. The object point P is known in the object coordinate system denoted by $X_o, Y_o, Z_o$ . There is an unknown rotation and translation of the object coordinate system with respect to the camera coordinate system. The rotation matrix and translation vector representing this transformation are denoted by $\bold{R}$ and $\bold{T}$ respectively.

The vector $P$ is known only in the object coordinate system, not in the camera coordinate system. This means we don’t know the coordinates of the vector $P$ shown in the figure. Indeed, from the definition of the rotation matrix, the coordinates of $P$ in the camera coordinate system are $T_x + R_1\cdot P, T_y + R_2\cdot P, T_z + R_3\cdot P$ .

The point $P$ projects to a point $p$ on the image. The coordinates of $p$ are known. Thus, the known quantities in POSIT are:

The coordinates of $P$ in the object coordinate system
Camera intrinsic parameters $C$ and $f$
Coordinates of the image point $p$ corresponding to each object point $P$

From these known quantities, we wish to determine the transformation ( $R, T$ ) between the camera and the object coordinate system.

First let’s consider the general equation for perspective projection. Following [1],

$\begin{bmatrix}wx\\wy\\w\end{bmatrix} = \begin{bmatrix}f\bold{R}_1^T & fT_x \\ f\bold{R}_2^T & fT_y \\ \bold{R}_3^T & fT_z \end{bmatrix}\begin{bmatrix}P \\ 1 \end{bmatrix}$

The image coordinates $x$ and $y$ are given by:

$x=\frac{f\bold{R}_1^T\cdot P + fT_x}{f\bold{R}_3^T\cdot P + T_z}, y=\frac{\bold{R}_2^T\cdot P + T_y}{\bold{R}_3^T\cdot P + T_z}$

Note that $P$ appears both in the numerator and denominator. In the denominator, it adds a contribution equal to the projection of $P$ on the optical axis of the camera. Thus, each image coordinate is scaled in proportion to the distance of the corresponding 3D point from the camera. This is a standard feature of perspective projection. However, because $P$ appears both in the numerator and denominator, we can’t write the equation above in a linear form and apply linear algebra techniques to solve for $R$ and $T$ .

Dividing the numerator and denominator by $T_z$ and denoting $\frac{f}{T_z}$ by $s$ , we obtain:

$x=\frac{s\bold{R}_1^T\cdot P + sT_x}{\frac{\bold{R}_3^T\cdot P}{T_z} + 1}, y=\frac{s\bold{R}_2^T\cdot P + sT_y}{\frac{\bold{R}_3^T\cdot P}{T_z} + 1}$

Again following the notation in [1], let’s denote $\frac{\bold{R}_3^T\cdot P}{T_z} + 1$ by $w$ . For object points that are far from the camera and/or lie close to the image plane, $w \approx 1$ . The $w's$ depend on the object point coordinates and the object-camera transformation and are different for each point. Now if somehow we know the values of $w$ for each object point, then we can write the perspective projection equation in a linear format:

$x=\frac{s\bold{R}_1^T\cdot P + sT_x}{w}, y=\frac{s\bold{R}_2^T\cdot P + sT_y}{w}$

Multiplying by $w$ on both sides and writing in matrix form,

$\begin{bmatrix}wx \\ wy \end{bmatrix} = \begin{bmatrix}P & 1 \end{bmatrix}\begin{bmatrix}s\bold{R}_1 & s\bold{R}_2 \\ sT_x & sT_y \end{bmatrix}$

Now we can solve for $R$ and $T$ using linear algebra techniques. It is important to understand that because we fixed the $w's$ , solving the linear equation above is not equivalent to solving the general perspective projection equation. Instead, solution to the equation above corresponds to finding the $R$ and $T$ such that the image coordinates of the scaled orthographic projection of the point $P$ on the plane given by $z=T_z$ are $(wx, wy)$ .

Let’s now look at iteration $t$ of POSIT. From the previous iteration, we have an estimate of the rotation and translation between the object and the camera coordinate system. Let’s denote this rotation and translation by $R^{(t-1)}$ and $T^{(t-1)}$ . The object point $P$ transformed by this transformation is shown as $P^{\rq}_t$ in the figure above. Its coordinates in the camera coordinate system are $T^{(t-1)}_x + R_1^{(t-1)}\cdot P, T^{(t-1)}_y + R_2^{(t-1)}\cdot P, T^{(t-1)}_z + R_3^{(t-1)}\cdot P$ . From this rotation and translation, we compute new values for the $w_k$ using the formula $w_k(t) = \frac{R_3^{(t-1)} \cdot P_k}{T^{(t-1)}_z}+1$ . Here $k$ is the object point index. Now consider equation 5 in [1]. This equation defines the objection function that is minimized in each iteration of POSIT. The objective function is a sum of $d_k$ defined as:
$\| \begin{bmatrix}P_k & 1 \end{bmatrix}\begin{bmatrix}s\bold{R}_1 & s\bold{R}_2 \\ sT_x & sT_y \end{bmatrix} - \begin{bmatrix}w^{(t)}_kx \\ w^{(t)}_ky \end{bmatrix} \|$

The $\begin{bmatrix}w^{(t)}_kx & w^{(t)}_ky \end{bmatrix}$ term in the equation above represents the scaled orthographic projection of the point of intersection of the line of sight of image point $p$ with a plane parallel to the image plane passing through the point $T^{(t-1)}_z + R_3^{(t-1)}\cdot P$ (denoted by $\Pi^{"}_t$ ). To see this, consider the line of sight through image point $p$ . A point on this line of sight can be represented as $cx, cy, cf$ . Since $p$ is the image of $P$ under perspective projection, $P$ lies on this line of sight, but we don’t know the corresponding $c$ .

The point of intersection of this line of sight with the plane $\Pi^{"}$ can be obtained by setting $cf = T^{(t-1)}_z + R_3^{(t-1)}\cdot P$ . Thus the coordinates of this point of intersection (shown as $P_L$ ) are $x\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{f}, y\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{f}, T^{(t-1)}_z + R_3^{(t-1)}\cdot P$ . From the definition of the perspective projection, the image coordinates (denoted by $p^{"}_t$ ) of the scaled orthographic projection of this point on the plane at $T^{(t-1)}_z$ are therefore $fx\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{T^{(t-1)}_zf}, fy\frac{T^{(t-1)}_z + R_3^{(t-1)}\cdot P}{fT^{(t-1)}_z}$ = $w^{(t)}_kx, w^{(t)}_ky$ .

As stated before, the first term in the definition of $d_k$ corresponds to the scaled orthographic projection of $P^{\rq}_t$ , denoted by $p^{\rq}_t$ . Thus at each iteration of POSIT, we compute rotation and translation such that the distance between the scaled orthographic projections and $(wx, wy)$ are minimized in a least square sense. This makes sense as when we have the correct rotation and translation, points $P, P^{L}, P^{\rq}$ (and thus points $p, p^{"}, p^{\rq}$ ) coincide. Thus the vector $p^{\rq}p^{"}$ is a measure of how far we are from the correct pose.

David P, DeMenthon D, Duraiswami R, Samet H. SoftPOSIT: Simultaneous Pose and Correspondence Determination. International Journal of Computer Vision. 2004;59(3):259-284. doi: 10.1023/b:visi.0000025800.10423.1f [Source]

Dementhon DF, Davis LS. Model-based object pose in 25 lines of code. Int J Comput Vision. 1995;15(1-2):123-141. doi: 10.1007/bf01450852 [Source]

Dementhon DF, Davis LS. Model-based object pose in 25 lines of code. Int J Comput Vision. 1995;15(1-2):123-141. doi:10.1007/bf01450852

Telesens

Be the first to comment

Leave a Reply Cancel reply