Benchmarks and Test-Beds for Cognitive Architectures

Hanks, Pollack, and Cohen discuss possible applications of benchmarks and test beds to general cognitive architectures. A benchmark is simply a standard task, representative of problems that will occur frequently in real domains. The advantage of using a benchmark is that comparative analysis of performance is possible; architectures are applied to the same task and the results of each measured against others. The problem with benchmarks is that they encourage focusing on the benchmarking problem instead of the real-world task and that benchmarks may be unconsciously prejudiced by their designers. In other words, the benchmarks should come from people who do not have an investment in the results of systems applied to the benchmark. Another problem with benchmarks, from the standpoint of AI, is that there are really no standard tasks for AI problems. However, several benchmarks have been proposed in AI, based on their recurrence. These include the Yale Shooting Problem and Sussman's anomaly.

Test beds are the environments in which the standard tasks may be implemented. In addition to the environment itself, these tools provide a method for data collection, the ability to control environmental parameters, and scenario generation techniques. The purpose of a test bed is to provide metrics for evaluation (objective comparison) and to lend the experimenter a fine-grained control in testing agents.

The use of test beds -- especially in small (highly abstracted) environments -- is somewhat controversial. There is a tension between bottom-up and top-down approaches to agent design. The former, which is somewhat reductionist, seeks to create agents by defining capabilities independent of one another. Test beds provide the means for developing these agents in a piece-meal fashion. This results in an incremental theory of behavior. The top-down approach is more engineering oriented: agents are built and then their performance tested. For such an approach, test beds offer only partial utility since abstracting away environmental considerations may make the agent appear more capable than it actually is (i.e. the results may not be as general as they appear).

In both these approaches, a small problem is used as an exemplar for very large problems. Yet there are issues in using small problems to predict or validate behavior on larger ones. Most significant are the issues of scalability and generality. In the first case, maintaining rationality with the addition of more knowledge and more capabilities may become impossible; efficiency decreases as the scale of the problem increases. Similarly, there may be interactions among capabilities and between individual capabilities and the environment that were not considered in the smaller problem. Thus, the system does not generalize to larger problems.

One way to avoid these issues is to experiment on full-scale systems. Controlled experimentation in such problems has been considered very difficult and even impossible. However, the systematic evaluation of large-scale systems will be necessary as cognitive architectures move out of the research laboratory and into real-world applications.


Press UP to go to the list of theories.

Press HOME to go to the Table of Contents.