Human and Organizational Factors in AI Risk Management Proceedings of a Workshop (2025) / Chapter Skim
Currently Skimming:

2 Evaluation, Testing, and Oversight
Pages 10-21

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 10...
... New York City's Local Law 144, the first legislation in the country requiring bias audits of automated hiring tools,2 was enacted, according to Givens, to check whether hiring tools have a disparate impact based on race, ethnicity, or sex. When advocates ques tioned why the law did not also require audits for discrimination based on other legally protected characteristics such as disability, the response was, according to Givens, that 1 A
From page 11...
... PANEL 1: SCOPING EVALUATIONS William Isaac, Google DeepMind and planning committee member, moderated the first panel of this workshop event. He began by asking the panelists how evaluations should be assessed and what trade-offs or challenges are associated with different evaluation methods.
From page 12...
... Laura Weidinger, Google DeepMind, described an empirical review of current AI evaluation practices, which revealed that evaluations predominantly focus on model centric assessments. These evaluations typically examine technical artifacts in isolation rather than conduct comprehensive safety assessments.
From page 13...
... He inquired about the impact factors such as evaluation demand and how organizational or public stakeholders are thinking about scoping risk. Isaac inquired about scoping methods, identification methods for risk, and validation methods regarding assumptions and E v a l u a t i o n , Te s t i n g, a n d O ve r s i g h t 13
From page 14...
... Weidinger empha sized the importance of using multiple assessments spanning a wide range of evaluation approaches to validate findings. Agreeing with Bommasani's point about cost as a barrier, Weidinger suggested that it may be possible to find less expensive proxies for expensive assessment approaches, allowing multiple evaluations to be conducted more easily.
From page 15...
... Holstein encouraged collaborating with domain experts at early stages, such as defining what "success" would look like for an AI tool and how it might be measured, prior to tool development. Margaret Mitchell, Hugging Face, emphasized how natural language generation -- technology that generates information based on a context-aware knowledge base -- ­ provides a valuable perspective for designing evaluations.
From page 16...
... To alleviate this concern, he advocated for randomized controlled trials observed over a period of years as a method to assess the accuracy of prediction. Heidari asked panelists to opine on the place of benchmarking in automated quantitative evaluations as well as how it should be complemented with other forms of evaluation.
From page 17...
... Building evaluations while incorporating human expertise to understand worker tasks, complex interactions between humans and AI systems, and the context in which they occur takes time and effort. Ahmad suggested public red teaming challenges to allow for crowdsourced testing, which is less resource-intensive than domain expert red teaming.
From page 18...
... Diane Staheli, White House Office of Science and Technology Policy, stated that President Biden and Vice President Harris made significant progress in advancing safe, secure, and trustworthy AI. She highlighted the Advancing Governance, Innovation, and Risk Management for Agency Use of Artificial Intelligence memorandum as a recent mile stone.3 Staheli indicated a need for nonprocedural evaluation standards and presented common challenges including a need for entities to own the process and steer the work, a need to maintain system knowledge over time to maintain evaluation quality, a need for open-source reference implementations, and the ability to determine what data are useful versus what are not.
From page 19...
... Jacob Metcalf, Data & Society, pointed to ongoing work at Data & Society with an approach called "impact assessment," which empowers communities to name the conditions of an assessment. Speaking on other ongoing work, Metcalf noted his recent publications assessing New York City Local Law 144, which prohibits organizations from using AI in employment decision making unless a bias audit is conducted and the resulting report is published.4,5 The stated publications highlight the strengths and weaknesses of the pioneering legislation, aiming to strengthen future approaches to bolster protections for individuals interacting with AI.
From page 20...
... He argued that the design of auditing institutions can be more impactful than additional evaluation efforts. Ho referenced his past work to categorize elements of audits such as disclosure, scope of audit, independence, selection funding, and accreditation that can be standardized through governance.6 Staheli noted that intelligible public disclosure enables contestability, which, in turn, bolsters the trustworthiness of a system.
From page 21...
... Staheli reaffirmed the need for multidisciplinary teams of experts, including AI, human factors, and domain experts, as well as end users to ensure an AI tool is fit for purpose. Ho argued that no one-size-fits-all solution exists, as both public and private solutions have limitations and pitfalls.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.