3D mesh from a collection of images

I recently stumbled upon the Visual Structure From Motion code from Changchang Wu. It’s a beautiful GUI frontend to a collection of utilities that include, among others, multicore bundle adjustment, SIFT on the GPU, and other functionality. Instructions on how to install it on Ubuntu are here. Its output on this dataset of high-res images of a statue is very similar to what is shown in this video. In addition to being easy to use, VisualSFM does not require camera calibration information, although you could configure it to do so. The only disadvantage that I can see is that it does not make it possible to enter known position and translation transformations between pairs of camera poses, which would be very useful for stereo imagery. In other words, due to using monocular camera data, the rotation information is accurate but the translation is not; at best the distances are scaled.  That said, it seems to work very well, and the meshes that it produces look great, particularly if you visually cover the entire object.

3D mesh from a collection of images

The Harris corner detector

My last computer vision assignment had a very cool component, so I thought I’d share it. We were supposed to implement Harris’ corner detector, which is a method for detecting corners in an image. By a “corner” we usually mean the point at which a pair of edges cross each other. For the purposes of this  method though, a corner is a pixel such that the intensities (of one of the RGB channels) within a small neighbourhood around it are significantly different than the intensities in the neighbourhood of a nearby pixel. A more detailed explanation of why the method works and how to implement it can be found here. These are the results of the algorithm, applied first to a simple image consisting of rectangles and then a real image (keep in mind that a threshold had to put on the number of corners marked on the image to avoid cluttering it):

Harris corner detection on synthetic image

Harris corner detection

The Harris corner detector

Affine spaces & projective geometry

What would happen if we took \mathbb {R}^{3} and pretended that we didn’t know where it’s origin was? One of the ramifications would be that points and vectors in that space would have to be considered different entities. When the origin O is fixed and we know it any point P in the space can be represented as the vector OP. This means that when we are working in an affine space (any point can be the origin), points are regarded as fixed locations in space and vectors are regarded as displacements without an origin — two different things.

Formally, an affine space consists of a vector space V and a set of points P such that:

  • if p_1,p_2 \in P then p_1-p_2 \in V
  • if p \in P and v \in V then v+p \in P

So, to specify an affine space we need a basis v_1,v_2 for V (suppose it’s the image plane) and a point p \in P to act as the “unknown” origin. v_1, v_2, p becomes the reference frame of the affine space. As usual, every vector in the affine space can be expressed as {c_1}{v_1}+{c_2}{v_2}+0p, while points are expressed as {a_1}{v_1}+{a_2}{v_2}+1p. In other words, vectors are of the form (c_1, c_2, 0) while points are of the form (a_1, a_2, 1).

Does this remind you of anything? Yup, homogeneous coordinates. I believe this is why my lecture notes mentioned that “(X,Y,Z) is interpreted as a line (vector, horizon) when Z=0 and as a point when Z=1.” It makes much more sense now then back when I first looked at that black magic of a sentence.

OK, so far so good. What really confuses me now, and confused some of my friends when they took the graphics course at UofT, is the question of why is it necessary to use affine spaces & homogeneous coordinates (and thus differentiate between points and vectors) in computer graphics? Why can’t we use our good old euclidean geometry? I believe the answer has to do with the types of transformations that euclidean geometry allows: only rigid-body transformations (i.e. identity, translations and rotations) are allowed and they preserve distances and angles.

Which classes of transformations are missing from that list and might be very useful in computer graphics? Linear transformations (e.g. scaling, reflection and shear), affine transformations (they preserve parallel lines) and projective transformations (artistic perspective). For instance, euclidean geometry says that two parallel lines never meet thus failing to describe perspective and the vanishing point on the horizon:

Projective geometry allows projective transformations such as the above. Euclidean geometry is a subset of affine and that in turn is a subset of projective, with respect to the classes of transformations each one includes. So, I presume that the answer to my question  is that in computer graphics people use projective geometry because it makes the end result much more realistic.

Some links to whet your appetite: #1, #2, #3, #4, #5, #6, #7

Affine spaces & projective geometry

Homogeneous coordinates

My first two lectures on computer graphics left me puzzled with some unanswered questions that have to do with basic projective geometry and its transformations. I’ll try to collect my thoughts and some links, and write them down primarily for the sake of summarizing to reduce my own confusion.

Basically, both in computer vision and in computer graphics the first geometric problem that we study is that of projecting points from a 3D scene to a 2D image plane (think of it as a photograph), using the pinhole camera model. The projection looks like this:

An important thing to notice on this picture is that any point that lies on one of the dotted lines will be projected to the same image location. This leads us to write (X,Y,Z)\sim(cX,cY,cZ) for any nonzero c, where (X,Y,Z) is the euclidean coordinate system that we are accustomed to. This “equality” holds in the sense of equivalence classes, not in the sense of euclidean geometry.

Using a similar triangles argument we can prove that the pixel (x,y) on the image is equal to (\frac{X}{Z}f, \frac{Y}{Z}f). However, (x,y) can also be viewed as part of the 3D scene like this (x,y,f)=(\frac{X}{Z}f, \frac{Y}{Z}f, f) \sim (\frac{X}{Z}, \frac{Y}{Z}, 1)\sim(X,Y,Z) provided that f,Z\neq 0. The correspondence between image pixels (x,y) and representatives of the equivalence classes of \sim is one-to-one. The equivalence class corresponding to an image pixel is called the homogeneous coordinates of that pixel. Many of the articles that I found online assumed f=1 for the sake of simplicity and mentioned that the mapping (x,y)\rightarrow(x,y,1) is “just a trick” to be followed blindly because {It Just Works}^\copyright

What happens when Z=0? In that case (X,Y,Z) does not correspond to any point in the euclidean space, but to a unique point-at-infinity along the direction (X,Y). The proof is in the first link.

Here are some of the links that might prove to be helpful for this topic: #1, #2

Homogeneous coordinates