IDR Team Summary 2
Identify the mathematical and computational tools that are needed to bring recent insights from theoretical image science and rigorous methods of task-based assessment of image quality into routine use in all areas of imaging.
There is an emerging consensus in the biomedical-imaging community that image quality must be defined and quantified in terms of the performance of specific observers on specific tasks of medical or scientific interest. Generically, the tasks can be classification of the objects being imaged, estimation of object parameters, or a combination of both. The means by which the task is performed is called the observer, a term that can refer to a human, some ad hoc computer algorithm, the ideal Bayesian observer who gets the best possible task performance, and various linear approximations to the ideal observer. For any task, observer, imaging system, and class of objects, a scalar figure of merit (FOM) can be defined by averaging the observer performance, either analytically or numerically, over many statistically independent image realizations. The FOM can then be used to compare and optimize imaging systems for the chosen task, but there are many mathematical and statistical details that must be observed in order to get meaningful FOMs in studies of this kind.
First, real objects are functions of several continuous variables (hence vectors in an infinite-dimensional Hilbert space), but digital images are sets of discrete numbers, which can be organized as finite-dimensional vectors. Imaging systems that map functions to finite vectors are called continuous-to-discrete (CD) mappings; a great deal is known about their properties if the systems are linear and nonrandom, but little has been done for nonlinear systems or ones that have unknown or randomly varying properties.
Because the FOMs are statistical, some collection or ensemble of objects must be considered, and stochastic models of the object ensemble are needed. For objects regarded as functions, important statistical descriptors include the mean object, various single-point and multipoint probability density functions (PDFs), the auto-covariance function, and the characteristic function (an infinite-dimensional counterpart of a characteristic function, from which all statistical properties of the object ensemble can be derived). Each of these descriptors has a finite-dimensional counterpart when the objects are modeled, for example, as a collection of voxels, but great care must be exercised in the discretization.
The object randomness leads to randomness in the images; therefore, it is important to understand how to transform the object statistics through the imaging system; again, nonlinear systems pose difficulties. In addition, there is always noise arising from the measurement process, for example Gaussian noise in the electronics or Poisson noise in photon-counting detectors. Statistical description of noise in raw image data for a specific object may be straightforward, but the resulting statistical descriptors must be averaged over the object ensemble in computing FOMs. Moreover, many imaging systems themselves must be described stochastically. When image processing or reconstruction algorithms are used, all statistical properties must be expressed after the processing, because that is the point where the observer performance is determined. Interesting algorithms are often nonlinear.
Two issues are mentioned: how one quantifies the uncertainty in estimates of FOMs and how one determines the statistical significance of differences in estimated FOMs. Approaches to this issue include bootstrap and jackknife re-sampling, Monte Carlo simulation, and theoretical analysis.
All of this requires efficient and realistic simulation tools. The objects, systems, processing algorithms, and observers must all be included in a complete simulation, and the code must be validated.
What current imaging applications would benefit from applying the principles of task-based assessment of image quality? What are the current methods of image-quality assessment in each? What are the important tasks? Are human observers customarily used?
What new applications would open up in various fields if we used
higher-dimensional images, such as video sequences, temporally resolved 3-D images, or spectral images?
For each application identified above, how should one model the imaging system? Are nonlinear models needed? Should the systems be described stochastically? What simulation code is now used?
Again for each application identified above, what is known about statistical descriptions of the objects being imaged and of the resulting images? What are the important noise sources? Are statistical models used currently in image analysis or pattern recognition for this field? Are databases of sample images readily available?
What new mathematical or computational tools might be needed for the applications identified? Are new image reconstruction algorithms, or new ways of applying and analyzing existing algorithms, needed? Is further work needed on noise characterization, especially in processed or reconstructed images? Are current simulation methods fast enough and sufficiently accurate?
Is it important to have assessment methods that use real rather than simulated data? What gold standards would be used for assessing task performance with real data? Would there be interest in methods for assessment with real data but with no reliable gold standard?
Barrett HH and Myers KJ. Foundations of Image Science; John Wiley and Sons; Hoboken, NJ, 2004.
Barrett HH and Myers KJ. Statistical characterization of radiological images: basic principles and recent progress. Proc SPIE 2007;6510:651002. Accessed online June 15, 2010.
Clarkson E, Kupinski MA, and Barrett HH. Transformation of characteristic functionals through imaging systems. Opt Express 2002;10(13):536-39. Accessed online June 15, 2010.
Kupinski MA, Hoppin JW, Clarkson E, and Barrett HH. Ideal-observer computation in medical imaging with use of Markov-chain Monte Carlo. J Opt Soc Am A 2003;20:430-8. Accessed online June 15, 2010.
Kupinski MA, Clarkson E, Hoppin JW, Chen L, and Barrett HH. Experimental determination of object statistics from noisy images. J Opt Soc Am A 2003;3:421–9. Accessed online June 15, 2010.
Because of the popularity of this topic, two groups explored this subject. Please be sure to review the second write-up, which immediately follows this one.
IDR TEAM MEMBERS—GROUP A
Alireza Entezari, University of Florida
Joyce E. Farrell, Stanford University
James A. Ferwerda, Rochester Institute of Technology
Alyssa A. Goodman, Harvard University
Farzad Kamalabadi, University of Illinois
Matthew A. Kupinski, University of Arizona
Zhi-Pei Liang, University of Illinois at Urbana-Champaign
Patrick J. Wolfe, Harvard University
Michael Glenn Easter, New York University
IDR TEAM SUMMARY—GROUP A
Michael Glenn Easter, NAKFI Science Writing Scholar, New York University
In November, the National Academies Keck Futures Initiative brought together top imaging scientists from across the country for an interdisciplinary conference on Imaging Science. IDR Team 2A was asked to consider the mathematical and computational tools that are needed to bring recent insights from theoretical image science and rigorous methods of task-based assessment of image quality into routine use in all areas of imaging.
Team 2A was comprised of researchers armed with a broad arsenal of imaging knowledge, including expertise in consumer imaging, which is imaging that deals with products for consumers, optical imaging, and imaging in engineering, astronomy, and computer science. During two days at the conference, the IDR team debated long and hard about the best way to identify the tools that are needed to bring insights from theoretical image science and rigorous methods of task-based assessment of image quality into routine use in all areas of imaging.
These images could be of anything: a tumor, land that has been burned by a forest fire, or a prototype of a part for an automobile. Unfortunately, no image is perfect. It is likely there will always be errors, but the discussion of Team 2A aimed to illuminate these errors so that uncertainty in images could be minimized—and for good reason.
Imagine, for example, that you are a doctor. One day a patient is referred to you who is exhibiting signs of a brain tumor: impaired judgment, memory loss, and impaired senses of smell and vision. All of the signs are
there. You run an MRI scan of the patient’s brain. Once the scans come back, you scrutinize them. What do you find?
Because you know what you’re looking for in the image, what you see may ultimately depend on the accuracy and detail of the image. Inaccuracy and insufficient detail are imaging’s enemy; they make an image less true and therefore less useful than it needs to be. If you, as the doctor, can estimate and compensate for imaging errors more accurately, the image becomes more useful for the task of providing better patient care.
The task is a critical part of this scenario. Above, you were probably looking for a tumor, so the “task” of the image from the MRI scan was to show the presence or absence of a tumor. Any errors in that image are thus made more or less relevant depending on whether the error affects your ability to see a tumor in that image. This is the essence of task-based assessment of image quality.
In task-based assessment, the “quality” of the image is determined by its usefulness to the scientist, doctor, or other professional using it (the “observer”). This usefulness can be quantified, and it often needs to be if the observer wants to know how helpful the image is going to be or how good an imaging system is. This quality score for the image can be termed a “figure of merit” (FOM).
An FOM can be any measure of the image’s quality. In task-based assessment of image quality, however, the FOM should ideally represent the ability of the image to help the observer complete the “task,” whether the task is detecting a tumor on an MRI, measuring the power spectrum of microwave background in astronomy research, or classifying a forest as deciduous versus coniferous from a remote sensing image.
For an imaging system, the FOM represents performance ability, that is, how helpful is the produced image. System performance is impacted by many factors, error perhaps being the most significant. With that in mind, the team began to decipher, discipline by discipline, how to identify distinct sources of error in imaging. Once these identifications could be made, then their effects could be evaluated. As the team began to see the similarities between the sources of error in various fields, they began to reevaluate the textbook definition of the imaging process.
Traditionally, the imaging process includes: (1) the object, which is captured by the (2) imaging system, at which point (3) noise is introduced before the image is viewed by the (4) observer, who then can assign the image an FOM.
But this framework does not account for all the sources of uncertainty
that affect the performance of a system and the ability of an image to aid in task completion. As the team discussed the many steps during the imaging process into which uncertainty could creep, a pattern emerged that prompted a new-and-improved flowchart of events in the imaging process:
The object is illuminated by a passive or active source, during which uncertainty exists in the illumination’s spectrum, intensity, direction, and time (as well as interactions between those variables).
The object itself, whether it is real, phantom, or simulated, has uncertainty in its physical and biological properties.
The emergent radiation that will be captured by the imaging system has spectral, temporal, and spatial variation that can introduce error. Emergent radiation from a number of sources around the object can distort the image at this step.
The imaging system has multiple sources of error and uncertainty, many of which are specific to the imaging modality and field of study, that include management of noise and instrument calibration.
The system generates data that must be processed into an output image for the observer. This often involves reconstruction algorithms, general restoration (including noise reduction), and specific processing geared toward the specific observer. These processing steps may introduce information loss, artifact generation, or other error.
The observer now views the image. The observer can be human (using visual and cognitive systems to interpret the image information), algorithmic, or a combination of the two, and different observers will have varying levels of experience or training—all additional sources of uncertainty that can affect the image’s usefulness and thus the assigned FOM.
Once arrived at the point in time to judge the image—to determine the FOM—we have encountered errors at every above step in the image’s creation, which lead to a less perfect image, at each and every step, the final step being the sum of those errors. How each of these errors affects an observer’s ability to use the image in a given task is specific to the task—a single image may be given different FOMs by different observers performing different tasks.
Here’s the catch: If an imaging system is used by multiple observers with multiple tasks, optimizing system performance using a task-based method may not help all observers (and thus all tasks) equally. Similarly, general strategies to improve imaging modalities will not necessarily be relevant across
fields. Identifying which sources of uncertainty negatively affect the FOM for a given task is the critical step to improving system performance.
All of the steps in which imaging errors occur build upon themselves, making a less perfect image—but an imperfect image may still allow the observer to complete the task. Perfect images would be ideal, but optimally performing images (images with perfect FOMs) might be the more prudent goal. If the next step in this process is to reduce errors, the logical question is not only which parts of the process can I improve, but also which parts will make the FOM improve? If each source of uncertainty and error is a knob on a large control panel, which knob(s) do I tweak to get what I need?
The team’s answer to this emerging question was a vision: a new, refined approach to imaging systems that, depending on the object being imaged and what needed to be gained from the image, various settings could be manipulated to reduce the most relevant sources of uncertainty for that task, like a control panel with various knobs available for tweaking. The system could thus allow a balancing act, shuffling the amount and type of errors to optimize the performance of the imaging system for each given task.
IDR TEAM MEMBERS—GROUP B
Ali Bilgin, University of Arizona
Mark A. Griswold, Case Western Reserve University
Hamid Jafarkhani, University of California, Irvine
Thrasyvoulos N. Pappas, Northwestern University
P. Jonathon Phillips, National Institute of Standards and Technology
Joshua W. Shaevitz, Princeton University
Remy Tumbar, Cornell University
Tom Vogt, University of South Carolina
Emily White, Texas A&M
IDR TEAM SUMMARY—GROUP B
Emily White, NAKFI Science Writing Scholar, Texas A&M
(modified by IDR team from original assignment)
How can task-based assessment be achieved? What approaches, if any, are already being used?
How do we define a task, and how do we define a figure of merit (FOM) for that task? What aspects of the imaging chain should be considered in assessing task performance?
Can models be used to assess image system performance? Should we use them as such; if so, how? Are current simulation methods fast enough and sufficiently accurate to aid in performance assessment? Is it important to have assessment methods that use real rather than simulated data?
How do we put task-based assessment into practice? What are the potential challenges involved in implementation of assessment approaches?
Ideally, task-based assessment of image-system performance would include all participants in the imaging “chain”: input (the object), system (the data generation), and observer (human or algorithm). All aspects of this chain would be statistically described, and predictive models would be used to test the performance of images, assigning to each system a task-based figure of merit (FOM). These FOMs, which would often be multidimensional to capture the maximum information about system performance, would then be used to compare performance across imaging modalities, thus discovering which systems perform best at a given task. These theoretical models of the imaging chain would also be used to simulate image output from theoretical imaging systems in order to decide which hypothetical systems are most prudent to build and use.
For both simulations and real-life testing of a system, again in an ideal scheme, standardized inputs would exist to maximally inform the FOM. Databases of such standardized input ensembles, and of the gold standard output ensembles, would exist for all conceivable tasks. For non-simulated tests, easily transportable calibration samples would be validated at multiple locations and then used to evaluate new systems and any modifications to existing systems. Observer ensembles would also be used in assessing task-based performance to account for variation in user decision making, especially with human observers.
In generating the FOM, ideally, error assessment would account for the fact that not all errors are equal—unlike (and in this case possibly superior to) a receiver operating characteristic (ROC) curve, which represents all task failures as equivalent points on a curve. Grievous errors (for example, miss-
ing a large tumor in a dangerous location) would be ranked as more serious within the error assessment process. Conversely, easily identifiable errors (such as a completely jumbled image that clearly does not resemble a typical image) would be ranked as less serious by the error assessment because they are easily recognizable and would likely not lead to serious adverse outcomes. These weighted performance outputs would be task specific.
The scenario above is desirable but currently impractical. Challenges of course exist that will impede the development and implementation of such performance assessment approaches.
Generating standardized inputs
Theoretical models of imaging systems are currently not sufficient to inform performance assessment. Although current models might be helpful in the development of new systems, the statistical descriptions of system components are not currently complete enough for model simulations to be fully predictive of system output, necessitating assessment approaches that use real inputs and data. However, standardized input ensembles also do not exist, and we currently lack an understanding of how many and what variety of images would be needed to best inform assessment.
Even with standardized input ensembles and datasets, ensemble optimal performance does not guarantee optimal performance on individual images. FOMs may not describe system performance as it applies to extreme cases, and these cases might in fact be the most critical—the outliers that you truly need your system to perform well on. Therefore, it remains unclear how to use an FOM to optimize a system when average performance may not correlate with performance on critical inputs. Similarly, standard input ensembles need to take into account the varying impact of different errors. As discussed above, not all errors are equal, and input sets need to include sufficient variety to allow for detailed error analysis. Unanswered, unfortunately, is the question of how to decide which errors are high versus low impact, how to weight errors based on these impacts, and whether this procedure will induce even more bias into an already noisy system.
Most and perhaps all relevant fields also lack gold standard data, and for many systems it is impractical to generate such an ensemble because the ground truth is often unknown—for example, exploratory images,
such as many imaging endeavors in astronomy, cannot be evaluated based on a “correct” answer, because the input has not yet been characterized. One problem with such images is processing; for example, removing noise from these images can be detrimental if the noise is relevant, but it is often difficult or impossible to know whether noise in these images is in fact a real and interesting phenomenon. Defining performance without knowing ground truth may be a prohibitive consideration in approaching task-based performance assessment, although some systems have robust imaging systems despite unknown inputs.
Evaluating the imaging system
Once a standardized input ensemble has been established, the next step is to determine which aspects of the imaging system are practical to consider when assessing performance. For example, the observer is likely a highly influential part of the imaging chain with respect to task performance, and uniform, reproducible performance by human observers may be unlikely. In many current imaging applications, a human is still the ideal observer given current technology. Especially for tasks like medical diagnoses, this is unlikely to change in the near future. Thus, in generating an FOM for use in optimizing a system, assessment should account for the costs of retraining the human observer. Systems should be optimized for the actual observer, not an abstract ideal observer—even in theoretical imaging models. Otherwise, an important aspect of the imaging chain is ignored.
For example, when an observer performs a task successfully or with errors, is it because of some component of the imaging device (as we would assume in optimization approaches that do not account for observer biases and preferences), or is it because of biases resulting from the observer’s tacit knowledge? Or could performance even be affected by some subtle issue in the particular way we defined the task? Improving image quality can actually impede human observers trained on noisy data. The ability to feed information back into the system during optimization is thus hampered by the inaccurate assumption that an observer is acting in a reproducible way.
The expert human observer, nonetheless, is a critical, beneficial part of imaging systems because of the ability to incorporate tacit knowledge and real-time, non-image information into the task performance. However, it is impractical to use humans in the optimization process—the cost of human participation and the number of humans needed is likely prohibitive. It is also unlikely, however, that we can accurately model a human observer, be-
cause we lack the ability to account for the tacit knowledge and non-image information that exists in the human brain. Lack of ability to model good/realistic observers may thus also limit our ability to assess new techniques.
Thus, because a task exists within a larger framework, we need to optimize both modality and interpretation. We need better models to aid in optimization, and models need to account for the human observer in feeding back information. Metrics must also account for the fact that tacit knowledge complicates task-based performance assessment. Observer performance depends on more than ideal image representation, and human observers have preferences and limitations—and these may vary between even highly similar tasks. As such, it is possible that FOM-based assessment is an unrealistic approach for multiple-observer systems.
Moreover, an important aspect of the imaging chain is the translation of system data into an image output—a model of the data and thus of the object. For example, the raw data output of a modern 3-D ultrasound system would be virtually impossible for a person to visualize, yet data modeling permits real-time 3-D computer reconstructions that are easily interpreted by human observers. However, not all image outputs are ideal; some imaging systems have shortcomings in data modeling, thus limiting the capability of the image to describe what we want to know about the object. Other potential pitfalls are failures of the system to produce consistent images, the presence of artifacts in object reconstruction, or the inability to map aspects of an image to an object’s physical characteristics. Having standardized inputs or accurate forward models (computer models of the image based on the object) is of little consequence in applications with poor (or poorly understood) data modeling for image outputs. It is thus important to understand and assess the data modeling within an imaging modality in order to understand its limitations and possible design improvements. Understanding data model limitations is especially important for ensuring consistent and reliable computer processing, which is often critical for high-throughput or large data volume applications.
Defining the task
Assuming the existence of standardized input ensembles and a well-defined understanding of the imaging chain, the question remains of how to define a task. Ideally, a task would map a scientific question to a system output. However, we can approach this question from the position that there exist two classes of imaging: a task-based class and an accuracy-based
class. Because one cannot predict future tasks, and it may be useful to use one image for many tasks or have the option to use an image later in the performance of other tasks, maximizing object representation (improving the accuracy of mapping the object’s physical characteristics) may be prudent. Object representation could be considered a task (the task could be defined as “maximum information gathering”), but this measure of performance is not traditionally considered task based. Nonetheless, object representation is important and relevant for the longevity and broad usefulness of an image (e.g., corroborating information from multiple imaging modalities, which was mentioned by other groups). Perhaps also important to note is that there may be scientific questions that involve image use that do not have a clearly identifiable “task.” Task-based methods thus have relevant limitations.
An additional consideration in defining a given task is that one purpose of task-based assessment is to generate information that will aid in optimization. In this case, cost and usefulness must be considered (not just performance): For a system that performs many tasks, it may be more practical to simplify and broaden tasks for greater applicability. In other words, we could optimize a system for a range of tasks, even though this would likely lead to suboptimal performance on specific tasks. Alternatively, it may be possible to include various settings within a system that could be adjusted to optimize performance on individual tasks. The best approach to practical task-based optimization is thus unclear.
In defining an FOM, it is also prudent to consider cost issues. For example, small incremental improvements in system performance might be expensive to implement, so it is important to define what level of difference between two FOMs merits upgrades or system adjustments. These infrastructure considerations also should account for the observer (as discussed above): There is a cost to human learning, and retraining observers, particularly human observers, to perform tasks using an improved system output could be prohibitively costly. Also worth considering are cultural norms within the human observer community—in addition to the observer learning curve, the “newness” of an image or system could impede implementation because of observer rejection of optimized systems and outputs. These ideas should be considered in defining the FOM and in deciding how to evaluate significant differences between two FOMs.
For task-based optimization, other issues include practicality of modifying an existing system. Reduction of dimensionality (model parameterization) may be a prudent approach to optimization in order to simplify the
process of system design or modification. Modeling would also be of great use for optimization procedures, but further noise characterization, as well as determination of how noise affects “object understanding,” is necessary before modeling can be practically used. Another approach to system optimization is subset analysis—joint optimization of parts of the imaging chain or system. Adjusting components of systems in this way makes optimization tractable in otherwise highly complex systems, although this approach may not reach a global optimum even when one exists. To perform subset analysis, however, we would need to define subtasks, which also would require performance assessment. In attempting to optimize a complex system, a challenge will be to define tasks that address specific subsystems and maximize the chances that individual optimization will result in a global optimum.
Future Directions and Recommendations
Despite the challenges of designing and implementing image metrics of system performance, initial steps toward task-based assessment of performance are prudent and achievable. Individual fields can begin to identify and gather or develop standard input ensembles for widespread use. Scientists can also identify or develop gold standards for imaging and create databases, also for widespread use. These input and output ensembles could be used to assess existing imaging systems via a “round robin” approach (imaging one input ensemble using many systems).
Careful consideration can be given to understanding and assessing the data modeling of an imaging modality. Such assessment should be made with due consideration to the class of objects being evaluated and the related task. For example, optimal imaging of microscopic transparent samples likely requires a different imaging modality than searching and recognizing faces in an airport screening system. The study of the data model will reveal its limitations and thus help in establishing avenues of research for optimization. It is also worth considering whether the limitations of the data model are so prohibitive as to merit an alternative approach—for example, synthetic aperture radar captures data but no image, and that is sufficient and accurate for a representation of the object to be computed, recognized, detected, and classified (i.e., “algorithmically understood”).
For analysis of system performance, one can devise a means of failure analysis: methods of ranking and weighting performance errors based on impact. We can begin to address potential methods of incorporating tacit
knowledge into modeling and assessment. In doing this, we should also consider how we might adapt our approaches in the future to address higher-dimensional images and advanced imaging methods and also how to improve statistical descriptions of the imaging chain (including the observer) to achieve adequate modeling. It is also worthwhile to consider whether some simple models might have the capacity to adequately inform performance assessment.
One practical and achievable goal is to design observer-based systems: Imaging systems could include settings for personalized optimization guided by real-time feedback including personalized error scores. Different imaging protocols could be optimized for each observer using realtime calculation of different views, displays, and contrasts with adjustable parameters. These personalized imaging systems would thus rely on both observer-based and task-based assessment, perhaps more effectively addressing system non-idealities.
These approaches, of course, need to be examined by a wide range of diverse communities. Nonlinear systems, which have potential high impact on imaging capabilities, may benefit most from some initial steps toward task-based assessment because of their complexity and current lack of sufficient methods for assessment. Such fields include compressed sensing, deblurring or deconvolution, nonlocal means filtering and estimation, and spatiotemporal methods.