With the development of the World-Wide Web there has been an explosion in the amount of digital images available. In order to search such a large and varied image repository efficient and effective techniques to retrieve images based on semantic content need to be developed. Keyword annotation is the traditional image retrieval paradigm. However, given the size of the image repositories, keyword annotation becomes infeasible due to the amount of labor required. The other difficulty in using textual annotations is the rich contents of the images and subjectivity of human perception makes it hard for the user to describe the images desired via keywords that match the wide-range of images the users may want. To overcome the difficulties in trying to perform the retrieval using textual annotations, content-based image retrieval (CBIR) was proposed in the early 1990s so that the user could provide a non-textual query. In particular, we consider the scenario where the user provides a query image Q that demonstrates, by example, the kind of images that the user would like retrieved from the image repository.
Most CBIR systems extract a signature for every image based on its pixel values and on aggregate information computed on larger blocks or segments of the image. Then a distance measure is defined between signatures that is used to rank the images in the repository based on their similarity to an image provided by the user. There has been significant work within the area of relevance feedback, much of which applies machine learning techniques to learning a weighting for the low-level features that are in turn used in computing the similarity between images. However, these techniques are still global in that all portions of the provided image(s) are used in ranking other images. Consider a scenario in which a user wants images that contain mountains but the images that are desirable to the user are likely to have other objects (e.g. trees, lakes, clouds) in them as well. By learning a weighting for the low-level features, for example, one can learn that texture, not color, is important. However, standard supervised learning does not provide any mechanism to learn which segments are the important ones.
The multiple-instance (MI) learning model is an interesting generalization of the standard supervised learning model that has been extensively studied for the problem of drug discovery. As applied to CBIR, each example in the MI model is a collection (or bag) of segments where each segment in turn can be represented using combination of low-level features to capture color, color layout, texture and shape. Instead of computing a signature for the image as a whole, a signature is computed for each segment of the image. Then, along with learning a vector for weighting, the low-level features of each segment, the MI model learns which segment(s) are important. Just as relevance feedback has been proposed as an alternative to having a user directly specify the scale factors for the low-level features, we propose using MI learning to obtain a segmentation-based technique like Blobworld that uses MI learning, versus user interaction, to determine which segments are important and to weigh these segments as to their importance.
More detail can be found in Localized Content-Based Image Retrieval that was an invited paper presented at the ACM Workshop on Multimedia Image Retrieval, November 2005.