DOI: 10.4324/9780415249126-W047-1
Version: v1,  Published online: 1998
Retrieved May 19, 2024, from

6. Computational models of vision: object recognition

Late or high-level visual processes use the representations of depth and surface orientation produced by early vision for tasks such as object recognition, locomotion and visually guided manipulation. Marr’s own account of late visual processing is rather sketchy. His concrete proposals concern the computational level of description, with little or no detail supplied at the algorithmic level. In general, computational models of high-level vision are not as well developed as accounts of early visual processes. The difficulty is due in part to the fact that later processing is hypothesis- (or goal-)driven, and hence cognitively penetrable. The input to these processes is not limited to information contained in the image. Object recognition, for example, makes use of specific knowledge about objects in the world. This knowledge is usually characterized as a catalogue of object types stored in long-term memory. It is worth noting that only at this rather late stage does the visual system do anything like identify what Gibson calls ‘affordances’, and in computational accounts such identification is typically treated as a process of categorization, in other words, as a psychological process (see Concepts §1).

Various types of computational models of object recognition have been proposed. According to the simplest models, recognizing an object currently in view involves comparing it with previously stored views of objects and selecting the one that most resembles it. A problem with this approach is that it fails to explain our ability to recognize objects from novel views that do not straightforwardly resemble any previously stored views.

More promising are accounts that treat object recognition as associating with the current view of the object a description of the object type, perhaps in addition to previously stored views of representative examples. Here again, different approaches are possible. ‘Invariant-property’ accounts assume that the set of possible retinal projections of objects typically have higher-level invariant properties that are preserved across the various transformations that the object may undergo. Such proposals face the same problem as Gibson’s account of higher-order invariants. For most object types it has proved impossible to find specifiable properties of the image that are common to all possible recognizable views.

The ‘decomposition’ approach to object recognition maintains that objects are identified on the basis of prior recognition of their component parts. An assumption of this approach is that the relevant part–whole relations are invariant and detectable in all possible views where the subject would recognize the object. The most developed proposal is Irving Biederman’s ‘recognition by components’ theory (1990), according to which a given view of an object can be represented as an arrangement of simple primitive volumes called ‘geons’ (for ‘geometric icons’). Geons can themselves be characterized in terms of viewpoint-invariant properties, and, proponents of the theory claim, are recognizable even in the presence of visual noise. In general, though, the decomposition approach to object recognition has proved to be fairly limited in its application. Many objects do not decompose in a natural way into easily characterizable parts; and for many of those that do the decomposition is insufficient to specify the object in question.

A third strategy, known as the ‘alignment’ approach, suggests that the visual system detects the presence of transformations between the current view of an object and a stored model, and can ‘undo’ the transformation to achieve a correspondence between the two. For example, suppose that the current view of the object differs from the model stored in memory because the object has undergone a three-dimensional rotation and moved further away from the viewer. On the current proposal, the visual system first detects the nature of the transformations, and then performs them in reverse on the current view to bring it into ‘alignment’ with the stored model (assuming that the object is rigid). The main problem for this approach, as for the other proposals, is its limited applicability. It is only feasible for a small range of possible transformations that an object can undergo (for example, rotation and scaling) and then only for a limited range of objects. (Imagine detecting and ‘undoing’ the rotation of a crumpled piece of newspaper.)

‘Mixed’ approaches to object recognition attempt to extend the range of applicability of the decomposition and alignment approaches by combining elements of the two, positing separate identification systems that operate in parallel. While mixed accounts appear promising, they face the additional burden of explaining how the outputs of the two recognition systems are combined.

Citing this article:
Egan, Frances. Computational models of vision: object recognition. Vision, 1998, doi:10.4324/9780415249126-W047-1. Routledge Encyclopedia of Philosophy, Taylor and Francis,
Copyright © 1998-2024 Routledge.

Related Articles