Page 1 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

Proceedings of a Workshop—in Brief

Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration

Proceedings of a Workshop—in Brief

The convergence of artificial intelligence (AI), biotechnology, and biomedical big data holds promise to transform understanding of human health and disease. Driven by the increasing availability and ability to generate, collect, and analyze environmental and biomedical data along with advanced computing power, AI and machine learning (ML) applications are rapidly developing in research and health. To explore opportunities for leveraging emerging developments in AI and ML to advance multimodal data integration, the National Academies of Sciences, Engineering, and Medicine hosted a workshop titled Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration on June 14-15, 2023. This workshop focused on recent developments in AI and other data-driven approaches to integrate biomedical and environmental health data; the exploration of promising applications in human health and disease; and the ethical, social, and policy implications and challenges of health data collection and integration.

The workshop convened experts from different sectors and disciplines including environmental health, biomedicine, data science, engineering, and policy. Speakers highlighted how integrating environmental and biomedical data, and health information (e.g., multi-omics, environmental health data, geospatial information, electronic health records, data from wearables) can help provide a holistic understanding of complex health challenges. The use of AI/ML to integrate disparate data sources is promising but scientific, analytical, and technical obstacles persist. Experts also discussed the ethical, social, and policy implications of AI/ML coupled with health data. For example, a biased training data set can lead to biased algorithms, inaccurate interpretations, and decision-making that may have unintended consequences. As Andrea Baccarelli (Columbia University) stated in his welcoming remarks, “everything from online searches to ChatGPT seems to be powered by AI today. And clearly it is ubiquitous, it has become very much part of our lives.”

The workshop was organized under the purview of the National Academies’ Standing Committee on the Use of Emerging Science for Environmental Health Decisions (ESEHD) and sponsored by the National Institute of Environmental Health Sciences (NIEHS). This Proceedings of a Workshop—in Brief provides the rapporteurs’ high-level summary of the discussions at the 2-day workshop, consisting of five sessions and a keynote address. Additional materials, including

Page 2 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

recordings from the workshop, are available online.¹ This proceedings highlights potential opportunities for action but should not be viewed as consensus conclusions or recommendations of the National Academies.

CONVERGENCE OF ARTIFICIAL INTELLIGENCE, ENVIRONMENTAL HEALTH, AND BIOMEDICINE

The first session set the stage for the workshop by examining the current state of research in environmental health and biomedical sciences, and by examining how AI may be used to advance scientific knowledge that benefits all. Patrick Breysse (John Hopkins University) laid out the current challenges in being able to fully define the environment and understand how various environmental factors affect health. How can the field of environmental health access new data streams and utilize them effectively? Conversely, in systems that are collecting vast amounts of data, environmental health data are often neglected. Inclusion of data about environmental factors with current monitoring systems is crucial, he said. Breysse described a few examples of the types of data being collected for various purposes and how that data may be applied and used to provide a broader picture of the environmental determinants of health.

Breysse discussed how satellite remote sensing offers significant opportunities for environmental health research. It can be used to track wildfires, assess water quality by detecting algal blooms, assess land use and vector ecology to understand infectious disease parameters, and estimate soil moisture. These data have implications for climate change, extreme weather events, emergency management, and response conditions. Furthermore, Breysse stated that advancements in sensor technologies can assess various environmental factors and exposures by using portable or wearable sensors. Wearables, the focus of another recent ESEHD workshop², allow individuals to contribute physiological and environmental health data that may help advance precision environmental health.

Breysse also noted that there are challenges in using the wealth of data available, including data integration and analysis for risk assessment, identification of priorities, and addressing environmental justice issues as disadvantaged communities often face greater exposures and health risks. “We need to think about how we use data, how do we address the quality issues associated with the data we collect, and how do we make decisions about it,” stated Breysse. Moving forward, he stressed the need for training and incorporating data science with environmental health.

Lucila Ohno-Machado (Yale University), focused on the potential role of AI and ML in advancing precision health and considerations for addressing health inequities. A personalized approach and predictive modeling in health could involve harmonizing and integrating multiple data modalities such as human genomes, microbiome, environmental health, electronic health records (EHRs), etc. Ohno-Machado stated, “Most importantly, we also not only characterize, but use that information to help mitigate inequities in health and healthcare.”

Ohno-Machado stated that building reliable AI-based models includes access to large and representative data sets and repositories but also emphasized the importance of protecting the privacy of individuals and institutions. She highlighted the National Institutes of Health (NIH) All of Us Program³ whose goal is to build a large and diverse health data set. The diversity of research cohorts, particularly segments of the population that were previously left out of research studies is important.⁴ For example, her work with the Center for Admixture Science and Technology Genomics for Everyone (CAST)⁵ aims to utilize genomics, environmental, and socioeconomic data to better understand health and disease in admixed populations.

Ohno-Machado ended her presentation with, “Our vision is that no one will be left behind, and we will increasingly replace concepts that are in use today

__________________

¹ https://www.nationalacademies.org/event/06-14-2023/advances-in-multimodal-artificial-intelligence-to-enhance-environmental-and-biomedical-data-integration-a-workshop (accessed July 5, 2023).

² https://www.nationalacademies.org/event/06-01-2023/developing-wearable-technologies-to-advance-understanding-of-precision-environmental-health-a-workshop (accessed July 5, 2023).

³ https://allofus.nih.gov/ (accessed July 5, 2023).

⁴ Lin M, D.S. Park, N.A. Zaitlen, B.M. Henn, and C.R. Gignoux. Admixed Populations Improve Power for Variant Discovery and Portability in Genome-Wide Association Studies. Front Genet. 2021 May 24;12:673167. doi: 10.3389/fgene.2021.673167.

⁵ https://admixgenomics.ucsd.edu/ (accessed July 5, 2023).

Page 3 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

for race and ethnicity with a combination of genetic, environmental, and social determinants of health, because each individual is different.”

Marzyeh Ghassemi (Massachusetts Institute of Technology) focused on designing machine learning processes for equitable health systems. She stated that, “We all have a focus and a goal of creating actionable insights in human health, but to get there we really need to understand how we build models to perform well, how we decide which data is most appropriate to be used in model training, what kind of healthcare we want to represent, and what kind of behaviors we want to encourage when we couple machine learning models with human usage.” ML models learn from human-generated data and design decisions but the current medical data landscape poses issues. Randomized control trial (RCT) data is often sparse and narrowly scoped, limiting its applicability to diverse populations. Ghassemi also noted that ML in health lags behind other subfields in terms of code and data sharing, as well as leveraging multiple datasets, leading to substandard reproducibility.⁶

Ghassemi described a study of how biased AI models can have serious impact when deployed in health by considering the underdiagnosis rate in different subpopulations.⁷ The study found that an AI algorithm trained on four chest x-rays data sets to classify pathology had an underdiagnosis rate in female patients, young patients, black patients, and patients on Medicaid insurance. Underdiagnosis by models can lead to a higher rate of no treatment for patients that need to be treated. She offered that auditing models is a solution but not a simple one with large language models that use human generated data with human biases. AI models can also be improved by explicitly including fairness constraints and considering intent in the design in a manner that ensures fairness across intersectional groups.

Ghassemi suggested that the lessons learned from other spaces in deploying technology safely, such as aviation, could be considered in the safe integration of AI in health. However, she said, there are some variables that will be unique to health that would make model evaluation and understanding the biases embedded in the data crucial.

During the panel discussion, Breysse, Ohno-Machado, and Ghassemi stressed the importance of understanding and mitigating algorithmic bias in developing various environmental and biomedical applications that integrate heterogenous data such as cumulative impact assessment and polygenic risk scores. They acknowledged the complexity of the problem given the huge volume and modalities of data needed and that there may not be a simple solution for all situations. For example, Ghassemi stated that in some cases integrating social determinants of health data improved the specificity of ML models in specific tasks for specific groups. In other instances, removing data such as self-reported race and social determinants may be appropriate for the task. In addition, the panelists noted that aside from the issue of AI and data integration is how humans interpret, use, and make decisions based on that data.

LEVERAGING AI/ML FOR ENVIRONMENTAL HEALTH AND BIOMEDICAL DATA INTEGRATION

Chirag Patel (Harvard University), discussed how the layering of modalities of data, including exposome measures may improve predictive modeling in health studies. For example, the addition of metabolomic signatures of diet, or internal exposome data, to self-reported diet, improved dietary behavior prediction and associated cardiometabolic-cardiovascular disease risk than traditional self-reports did alone.⁸ Another study assessing environmental exposures for type 2 diabetes discovered significant and novel associations. The computed polyexposure score, a weighted sum of multiple exposures, performed better in reclassification of type 2 diabetes risk than the polygenic score, or the weighted sum of an individual’s genetic variants

__________________

⁶ McDermott M., S. Wang, N. Marinsek, R. Ranganath, L. Foschini, and M. Ghassemi. 2021. Reproducibility in machine learning for health research: Still a ways to go. Sci. Transl. Med. DOI: 10.1126/scitranslmed. abb1655.

⁷ Seyyed-Kalantari, L., H. Zhang, M. McDermott, I. Chen, and M. Ghassemi. 2021. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med 27, 2176–2182. https://doi.org/10.1038/s41591-021-01595-0.

⁸ Shah RV, Steffen LM, Nayor M, Reis JP, Jacobs DR, Allen NB, Lloyd-Jones D, Meyer K, Cole J, Piaggi P, Vasan RS, Clish CB, Murthy VL. Dietary metabolic signatures and cardiometabolic risk. Eur Heart J. 2023 Feb 14;44(7):557-569. doi: 10.1093/eurheartj/ehac446.

Page 4 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

associated with a phenotype.⁹ Patel posited that the addition of exposures data to genetics and clinical measures can improve disease risk predictions.

Patel outlined a few key challenges in using multimodal approaches for data integration beginning with study design. To illustrate the complexity of the problem, Patel described a nutritional epidemiology study review examining 50 common ingredients found in a cookbook for associated cancer risk in published studies. Meta-analysis led to different and inconsistent relative risk estimates when looking at multiple parameters.¹⁰ The challenge, therefore, is ensuring that the inclusion of multi-modalities and parameters in modeling are robust. Integrating data across diverse scales such as geography, time, tissue, experimental models, etc., is complicated and remains an active area of research.

Aidong Zhang (University of Virginia) focused on how multimodal ML approaches can improve data integration by addressing one of the most common problems seen when combining data from different modalities—missing data. Current solutions for handling missing data include imputation but with advanced ML models, Zhang’s research looks beyond imputation to the fusion of different modalities with incomplete data.

Her approach involves constructing and representing multimodal data with incompleteness as a graph model for further processing. The data can then be mapped into an embedded representation via a graph neural network. Each data sample and its missing parts are treated as a different pattern and represented as a unique hypernode, avoiding the need to impute missing data. Links between hypernodes, called hyperedges, are established based on the similarity between different modalities. This enables coordination of the multimodal data into a heterogeneous hypernode graph, effectively managing the incompleteness without requiring imputation of missing data.¹¹

Heidi Hanson (Oak Ridge National Laboratory) highlighted a few studies at Oak Ridge National Laboratory (ORNL) to illustrate the utility of integrating data sets to advance understanding about health and disease and how to start thinking about study design to facilitate integration. One such study is in partnership with the National Cancer Institute, called the Model Outcomes using Surveillance data and Scalable Artificial Intelligence for Cancer (MOSSAIC).¹² MOSSAIC utilizes data such as imaging, multi-omics, and electronic health records from the Surveillance Epidemiology End-Results Registries (SEER) with the goal of modeling cancer outcomes and impact of diagnostics and treatments.

Hanson described one approach for integrating different data types is building foundational models that are pre-trained and can then be adapted to many types of tasks, in contrast to training one model for one specific task. The team at ORNL is currently working on developing these foundational models that can be used on data such as imaging, exposome, biological, clinical text, and surveys and combine them in ways to address the relevant questions. She also emphasized the importance of ensuring that models are reproducible, replicable, and usable for real world applications and also the importance of interdisciplinary team science. Hanson stated, “when we’re talking about multimodal data, we’re talking about omics, [imaging, environmental exposures, electronic health data].”

In the panel discussion, Patel, Zhang, and Hanson expanded on developing the necessary components in multimodal data integration. Zhang noted that in addition to missing data, being able to handle the heterogeneity of the data streams also needs to be resolved. Patel added that data access and data standards would help facilitate integration and evaluation. Hanson

__________________

⁹ Akhtari FS, Lloyd D, Burkholder A, Tong X, House JS, Lee EY, Buse J, Schurman SH, Fargo DC, Schmitt CP, Hall J, Motsinger-Reif AA. Questionnaire-Based Polyexposure Assessment Outperforms Polygenic Scores for Classification of Type 2 Diabetes in a Multiancestry Cohort. Diabetes Care. 2023 May 1;46(5):929-937. doi: 10.2337/dc22-0295.

¹⁰ Schoenfeld JD, Ioannidis JP. Is everything we eat associated with cancer? A systematic cookbook review. Am J Clin Nutr. 2013 Jan;97(1):127-34. doi: 10.3945/ajcn.112.047142.

¹¹ Jiayi Chen and Aidong Zhang. 2020. HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘20). Association for Computing Machinery, New York, NY, USA, 1295–1305. https://doi.org/10.1145/3394486.3403182.

¹² https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/mossaic (accessed July 5, 2023).

Page 5 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

discussed the potential for increased risks in privacy and de-identification when combining data sets and emerging research in federated learning to preserve privacy. Similar to the first panel, the three speakers also expressed that while the workshop focuses on methodologies and technologies, human interpretation of the AI-based model or generated information could also be considered.

EMERGING METHODS FOR DATA INTEGRATION

Joyce Ho (Emory University) presented a case study for integrating novel data streams such as from social media and community forums, geographic data, and mobility data, that can provide information about social determinants of health on a hyperlocal scale. She said her work is motivated from the perspective of health equity. By using these publicly available data streams, Ho’s strategy aims to capture neighborhood-level measures without additional burdens on patients or healthcare providers. Her research utilizes AI-based methods to develop novel measures of a patient’s built environment and its impact on health outcomes to enhance traditional Census-based deprivation index.¹³

To extract relevant information from unstructured data on social media platforms, Ho’s team developed an unsupervised ML model to extract key words and representations and then learn representations where similar words are clustered together and differentiated from dissimilar words. Human-guided annotations to refine categories of interest help improve the model and obtain better fine grain metrics. For integrating the different data streams, they also developed a deep learning model that learns non-linear relationships trained on real-world data for entity matching and data integration.

Thomas Hartung (John Hopkins University) discussed the opportunities for advancing new methods in environmental health research, particularly evidence integration for toxicology, enabled by the increase in data, computing power, and AI algorithms. He highlighted existing tools in the pipeline that may help facilitate data integration at various stages. For example, BioBricks, a one-line code can import entire public databases for training data. To date, approximately 50 BioBricks have been constructed and will soon be publicly available.¹⁴

Hartung also discussed how natural language processing has increased the capability to extract and mine data from published scientific literature at an unprecedented rate and volume. The current challenge is the “reading” of the multimodal data such as tables and graphs. BioGPT, specifically trained on scientific literature is one such tool.¹⁵ Another example is Sysrev, a semi-automated systematic review and data extraction platform.¹⁶ Hartung’s team is currently working on refining Sysrev to be able to recognize entities such as genes, enzymes, and causal relationships between entities to be able to import toxicological data from the literature in a meaningful way. He stated that the aim in toxicology data integration is to ultimately develop predictive algorithms for exposure assessment that improves on existing methods such as animal testing.¹⁷

To close out the session, Ho and Hartung discussed privacy and security related issues, and public trust in AI models of scientific and health information. The panelists reiterated that there is ongoing research in federated learning and other privacy preserving technologies. In speaking about barriers in data integration, Ho stated that data standards and harmonization are needed to aid data sharing and Hartung added that encouraging data sharing through transparency is important. The panelists also addressed participant questions regarding the use of ChatGPT and other large language models for scientific research. For example, Hartung cited an analysis that claims ChatGPT can draft a scientific paper reasonably well but in five years, the quality will be equal to that of skilled science writers. Ho cautioned that reproducibility

__________________

¹³ Zhang J., S. Lin, Y. Wu, J. Zhang, A. Morris, S.A. Patel, and J.C. Ho. 2022. Abstract 15011: Deriving and Validating Novel Neighborhood Data for Investigation of Adverse Outcomes in Patients Hospitalized for Heart Failure: A Feasibility Study. Circulation. https://doi.org/10.1161/circ.146.suppl_1.15011.

¹⁴ https://docs.biobricks.ai/index.html (accessed July 5, 2023).

¹⁵ https://github.com/microsoft/BioGPT (accessed July 5, 2023).

¹⁶ Bozada T Jr, Borden J, Workman J, Del Cid M, Malinowski J, Luechtefeld T. Sysrev: A FAIR Platform for Data Curation and Systematic Evidence Review. Front Artif Intell. 2021 Aug 5;4:685298. doi: 10.3389/frai.2021.685298.

¹⁷ Maertens A, Golden E, Luechtefeld TH, Hoffmann S, Tsaioun K, Hartung T. Probabilistic risk assessment - the keystone for the future of toxicology. ALTEX. 2022;39(1):3-29. doi: 10.14573/altex.2201081.

Page 6 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

is one of the challenges in utilizing large language models such as ChatGPT. Slight language tweaks to the prompts can generate varying answers and thinking about the purpose and task is important.

EXAMPLES OF PROMISING OPPORTUNITIES FOR MULTIMODAL AI IN HEALTH

Richard Woychik (National Institute of Environmental Health Sciences) discussed the potential of multimodal AI to understand health and disease and improve health outcomes. Emerging research in this area is driven by the availability and ability to generate multi-omics data, such as genomics, proteomics, transcriptomics, and more recently exposomics data, alongside AI-based tools to be able to integrate these various data sets. In particular, Woychik said AI’s analytical capabilities could be promising for “elucidating relationships between environmental exposures and human health with a degree of resolution that I predict that we have not seen up to this point.” He stressed the importance of collaborations and fostering discussions on innovative approaches and technologies to begin to unravel the complexities of an integrative health model. Woychik also highlighted key AI and data science initiatives at NIEHS, including developing standardized language and ontologies of environmental health data¹⁸ and collaborating with other NIH Institutes and Centers to integrate data resources across the broader, biomedical enterprise.

Eric Topol (Scripps Research Institute) focused on the emergence of multimodal AI in health and medicine in his keynote address.¹⁹ He said, “the idea is taking everything known in medicine through publications, along with the different inputs of images, electronic health records, sensors, biologic data, all of those layers of data that you could take medicine to a new plateau.”

One area where he sees the impact of AI is in improving accuracy and reducing diagnostic errors, particularly the ability of AI to interpret medical scans is “remarkable and far beyond human capability,” said Topol. A 2015 National Academy of Medicine report stated that most people will experience at least one diagnostic error in their lifetime.²⁰ Using deep learning trained and evaluated on chest radiographs for breast cancer classification, researchers found that the model was as accurate as experienced radiologists.²¹ Similarly, a recent study found that combining multimodal inputs to aid clinical diagnosis of pulmonary disease, radiographs, clinical lab results, and clinical notes, outperformed previous models that only utilized a single data stream in accuracy. The multimodal model also improved the predictive power of adverse outcomes.²²

Topol briefly outlined a few concerns for consideration including data bias, data security, large language models generating false information known as hallucinations, carbon emissions, validity, and safety issues in healthcare. Despite the various challenges, Topol stated that he is most excited about the promise of AI to “bring back the humanity in medicine.”

Topol and Woychik addressed challenges and opportunities related to data repositories given the intensive data requirements of AI multimodal models in a fireside chat. Topol highlighted a few international initiatives similar to NIH’s All of Us,²³ the UK BioBank,²⁴ and the Human Phenotype Project.²⁵ He also noted efforts in Japan and other countries focusing on a framework for standardized data collection, organization, and accessibility. Woychik stated that while there are new efforts in establishing a global biodata coalition, proactively considering the design of data repositories in terms of what the world community needs, how to create it, and how to sustainably fund them is important. The panelists also discussed the types of data to help advance

__________________

¹⁸ https://www.niehs.nih.gov/research/programs/ehlc/resources/index.cfm (accessed August 7, 2023).

¹⁹ Acosta, J.N., G.J. Falcone, P. Rajpurkar, and E. Topol. Multimodal biomedical AI. Nat Med 28, 1773–1784 (2022). https://doi.org/10.1038/s41591-022-01981-2.

²⁰ National Academies of Sciences, Engineering, and Medicine. 2015. Improving Diagnosis in Health Care. Washington, DC: The National Academies Press. https://doi.org/10.17226/21794.

²¹ Wu N, Phang J, Park J, Shen Y, Huang Z, Zorin M, et al. Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening. IEEE Trans Med Imaging. 2020 Apr;39(4):1184-1194. doi: 10.1109/TMI.2019.2945514.

²² Zhou, HY., Yu, Y., Wang, C. et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat. Biomed. Eng 7, 743–755 (2023). https://doi.org/10.1038/s41551-023-01045-x.

²³ https://allofus.nih.gov/ (accessed July 5, 2023).

²⁴ https://www.ukbiobank.ac.uk/ (accessed July 5, 2023).

²⁵ https://humanphenotypeproject.org/home (accessed July 5, 2023).

Page 7 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

understanding of health. Woychik emphasized exposomic data, which is the totality of exposures and biomarkers of its effect, is critical. Topol added wastewater surveillance data is important for monitoring infectious disease. Both panelists agreed that dietary exposures and metabolic profiling is an exciting and emerging area of research.

PERSPECTIVES ON AI AND DATA GOVERNANCE AND INFRASTRUCTURE

Susan Gregurick (NIH) opened the session by highlighting various AI and data science initiatives at NIH, such as the Helping to End Addiction Long-term Initiative (HEAL)²⁶, and the Medical Imaging and Data Resource Center.²⁷ The infrastructure used to aggregate and harness data for AI is an ongoing effort at NIH. Through partnerships with cloud service providers, NIH maintains 206 petabytes of data available to researchers as part of the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.²⁸ The Artificial Intelligence/Machine Learning Consortium to Advance Health Equity and Researcher Diversity (AIM-AHEAD) Program focuses on increasing diversity in the workforce and data.²⁹ The program aims to address health disparities and inequities using AI by partnering with communities and providing funding, training, and infrastructure.

Gregurick said that to achieve data integration, common understandings of what research questions are being asked, what kinds of data are collected, and common data elements are crucial. She described a current area of asthma research involves the construction of environmental and social determinants of health to be able to answer meaningful questions.

The Centers for Disease Control and Prevention’s (CDC) new Office of Public Health Data, Surveillance, and Technology³⁰ aims to modernize public health infrastructure and supports AI initiatives. Jorge Calzada (CDC) described a few AI-based applications for health surveillance utilized by the CDC. Computer vision models are leveraged to enhance screening for tuberculosis³¹ and process satellite images to identify and locate large cooling towers that may be potential sources of outbreak for Legionnaires’ disease. The goal of these tools is to help state and local partners respond faster to outbreaks.

Calzada said he believes “about 70% of healthcare organizations still transmit data via fax machines” and there is a significant burden on local epidemiologists’ time spent in data wrangling. That burden could be offloaded onto machines, he offered. Missing data elements are also important considerations in understanding health equity. He stated, “the brittle, inelastic data pipelines that come into the CDC is the biggest area of need for artificial intelligence, for multimodal artificial intelligence.” AI-based systems would likely benefit from being built for scalability, safety, and ethics starting at the design phase. Calzada also noted that simple considerations for infrastructure they are examining at the CDC is the ability to host solutions for state and local partners that may lack the resources.

Janet Haven (Data and Society) then turned to a discussion on AI governance issues and ongoing efforts to establish rules and regulations for AI systems. The recognition that AI is a democracy issue has prompted concerns about discrimination and misinformation through algorithmic systems, while also acknowledging the potential societal benefits of AI. Technology governance is not new but AI poses some novel questions and challenges. Haven stated that President Biden’s Administration plan to develop a national AI strategy presents an opportunity for society to establish core values, equity, access to opportunity, and a rights-based framework to guide American AI policy. Building a research and development ecosystem that prioritizes the social impacts of AI alongside technical advancements is important. She said, “Without understanding the impact on people and on the environment, we really cannot govern AI justly or sustainably.”

__________________

²⁶ https://heal.nih.gov/ (accessed July 11, 2023).

²⁷ https://www.midrc.org/ (accessed July 11, 2023).

²⁸ https://datascience.nih.gov/strides (accessed July 11, 2023).

²⁹ https://datascience.nih.gov/artificial-intelligence/aim-ahead (accessed July 11, 2023).

³⁰ https://www.cdc.gov/ophdst/index.html (accessed August 7, 2023).

³¹ https://github.com/scotthlee/hamlet (accessed July 11, 2023).

Page 8 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

The European Parliament recently passed a draft law of what is known as the EU AI Act which uses a risk-based framework to assess AI systems and applications.³² Haven noted that the proposed law is not regulating a technology per se but regulating the use of the technology. High risk applications include harms to people’s health, safety, fundamental rights, or to the environment.³³ In the US, the Office of Science and Technology Policy released the blueprint for an AI Bill of Rights in October 2022, which is a rights-based framework for governing AI.³⁴ Haven pointed out that the blueprint is a policy and not enforceable law. The blueprint includes five core protections: safe and effective systems; data privacy; protection against algorithmic discrimination; notice and explanation, which means knowing when automated systems are used and understanding the impact; and lastly human alternatives, consideration, and fallback.³⁵ In February 2023, President Biden issued an executive order³⁶ that includes a directive for federal agencies to address protections against algorithmic discrimination as an emerging civil rights risk.³⁷

Suzanne Dorsey (Maryland Department of the Environment) presented work on the Chesapeake Bay Restoration³⁸ as a focal point for equity, environmental health, and human health and opportunities for AI to advance environmental restoration. She described the restoration efforts as being grounded in a rigorous quantification and verification system with quantifiable outcomes and metrics. The model has evolved to an integrative perspective that is continuously updated with real-time assessment of the environment and outcomes.

Dorsey suggested that now is the time to utilize AI to support environmental and public health outcomes and inform decision-making. Enhanced and expanded environmental real-time monitoring coupled with a feedback loop is needed, she stated. “Can AI identify vulnerability, gaps, and opportunity to inform environmental actions?” said Dorsey.

The panelists also discussed the challenges of integration due to existing data silos. For example, the CDC’s centers and offices have different data models, funding, and capabilities. Calzada offered a few approaches to connect disparate data sources such as a centralized analytics platform or through semantic integration in which the data is represented as an ontology and connected through a graph database structure that can be queried. Gregurick echoed that NIH faces similar problems with its 27 institutes and centers. One approach is creating federated data infrastructure and harmonizing common data elements.

TECHNOLOGIES AND TOOLS TO ADVANCE ENVIRONMENTAL HEALTH AND BIOMEDICAL RESEARCH

The recognition that massive amounts of data are needed to train AI models was a common theme woven throughout the workshop discussions. The final session of the workshop explored technologies and tools to generate or capture different types of health data.

Lorenzo Hankla (Department of Defense) highlighted work within the Chemical and Biological Defense Program³⁹ that aims to protect warfighters from hazardous threats and environmental exposures. He is particularly interested in wearables and noted the increased investment in wearables due to the COVID-19 pandemic. The goal is to be able to design and deploy inexpensive wearables coupled with an AI-based model to process and analyze data to inform decision-making. A key challenge for Hankla involves building the capability to transmit data from wearables and sensors to an analytics platform and ultimately deriving meaningful information at an individual and population level. This is a particularly complicated issue due to personnel working

__________________

³² https://artificialintelligenceact.eu/ (accessed July 17, 2023).

³³ Ibid.

³⁴ https://www.whitehouse.gov/ostp/ai-bill-of-rights/ (accessed July 17, 2023).

³⁵ Ibid.

³⁶ https://www.whitehouse.gov/briefing-room/presidential-actions/2023/02/16/executive-order-on-further-advancing-racial-equity-and-support-for-underserved-communities-through-the-federal-government/ (accessed August 7, 2023).

³⁷ https://www.whitehouse.gov/briefing-room/statements-releases/2023/02/16/fact-sheet-president-biden-signs-executive-order-to-strengthen-racial-equity-and-support-for-underserved-communities-across-the-federal-government/ (accessed July 17, 2023).

³⁸ https://mde.maryland.gov/programs/Marylander/Pages/action_plan_for_restoration.aspx (accessed July 17, 2023).

³⁹ https://www.jpeocbrnd.osd.mil/ (accessed August 7, 2023).

Page 9 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

in a variety of locations that may lack access to technical infrastructures for data transmission.

Akane Sano (Rice University) described her work in designing a personalized feedback system that combines the capture of multimodal data through wearables and sensors, an interpretation module, and a feedback module that can provide actionable information for the user. For example, Sano’s team is currently conducting a study to evaluate sleep patterns and predict health and well-being for shift workers such as doctors and nurses.⁴⁰ Participants’ physical activity, sleep, and heart rate are monitored with wearables and an ML model provides well-being predictions and suggestions for sleep improvement based on cognitive behavioral therapy for insomnia. Medical doctors also review the data and intervention suggestions.

Sano focused on incorporating three concepts in her approach: designing a fair and equitable ML model for diverse groups, creating robust and interpretable models with limited labeled data, and deploying multimodal models. In addition to diversifying the data sets, generalizable bias mitigation methods may help ensure that models are accurate and applicable for different groups of people.⁴¹ It is expensive and time-consuming to obtain labeled data, especially with large amounts of continuous data from wearables. Sano described one technique to leverage unlabeled data, contrastive learning, by training the model to differentiate between similar and dissimilar data.⁴² Generating multimodal inputs and deployment in the real world can be difficult when factoring in user burden. To reduce the number and size of devices, cost, and energy consumption, Sano designed a More to Less framework for a model that requires fewer inputs while still preserving performance.⁴³

Gengchen Mai (University of Georgia) delved into more detail on AI foundation models, particularly for geospatial and health tasks. Examples of foundation models which are large task agnostic models that are pre-trained and can be adapted, include OpenAI’s ChatGPT⁴⁴ and DALL-E.⁴⁵ He investigated various foundation models on established geospatial semantic tasks, or how well they recognize large-scale toponym or place-name and location descriptions when compared to fully supervised task-specific models. Foundation models consistently outperform fully supervised models for both toponym and description recognition.⁴⁶ Applying foundation models to health geography tasks, Mai utilized ChatGPT to do time-series forecasting on health records, in this case dementia statistics by US counties given historical dementia statistics. Similar to geospatial tasks, ChatGPT outperformed fully supervised models.⁴⁷

Mai also enumerated a few unique challenges and limitations of foundation models for geospatial tasks including handling geo-coordinates and performing implicit spatial reasoning grounded in the real world. This is relevant for environmental health research that utilizes geospatial data because geospatial data itself is multimodal, that is geospatial vector data, remote sensing images, StreetView images, geo-tagged data, and geographic knowledge graphs. Mai proposed that multimodal foundation models for geospatial data is needed.

Nicholas Skaff (CDC) provided an overview of the Environmental Public Health Tracking Program⁴⁸ and the wealth of data available through the program that may be used to train AI models. The CDC program is a comprehensive tracking initiative aimed at connecting environmental information with health data, covering various environmental hazards, human exposures,

__________________

⁴⁰ Yu, H., A. Itoh, R. Sakamoto, M. Shimaoka, and A. Sano. 2021. Forecasting health and wellbeing for shift workers using job-role based deep neural network. https://doi.org/10.48550/arXiv.2106.12081.

⁴¹ Zanna, K., K. Sridhar, H. Yu, and A. Sano. 2022. Bias Reducing Multitask Learning on Mental Health Prediction. Affective Computing & Intelligent Interaction (ACII 2022). https://doi.org/10.48550/arXiv.2208.0362.

⁴² Yu, H., H. Yang, and Akane Sano. 2022. LEAVES: Learning Views for Time-Series Data in Contrastive Learning. https://doi.org/10.48550/arXiv.2210.07340.

⁴³ Yang, H., H. Yang, K. Sridhar, T. Vaessen, I. Myin-Germeys, and A. Sano. 2022. More to Less (M2L): Enhanced Health Recognition in the Wild with Reduced Modality of Wearable Sensors [EMBC 2022]. https://github.com/comp-well-org/More2Less.

⁴⁴ https://openai.com/chatgpt (accessed August 7, 2023).

⁴⁵ https://openai.com/dall-e-2 (accessed August 7, 2023).

⁴⁶ Mai, G., C. Cundy, K. Choi, Y. Hu, N. Lao, and S. Ermon. 2022. Towards a foundation model for geospatial artificial intelligence. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems (SIGSPATIAL ‘22). Association for Computing Machinery. https://doi.org/10.1145/3557915.3561043.

⁴⁷ Mai, G., W. Huang, J. Sun, S. Song, D. Mishra, N. Liu, S. Gao et al. 2023. On the opportunities and challenges of foundation models for geospatial artificial intelligence. arXiv preprint arXiv:2304.06798.

⁴⁸ https://www.cdc.gov/nceh/tracking/index.html (accessed July 11, 2023).

Page 10 Cite

Suggested Citation:"Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/27202.

×

health effects, and population characteristics. To ensure consistency and harmonization of data across different locations, the program created a Nationally Consistent Data and Measures (NCDMs) for data such as hospitalizations, emergency department visits, birth defects, drinking water quality, radon testing, among other health information. The Interactive Data Explorer contains more than 700 different environmental and health data measures and incorporates visualization tools.⁴⁹

REFLECTIONS

Carmen Marsit (Emory University), the chair of the planning committee, provided a brief summary of the presentations and discussions over the course of the workshop. His closing remarks included examples of key themes that emerged as panelists and participants considered both the promises of AI and multimodal data integration, alongside associated challenges. Marsit stated, “We have seen in the last 2 days that that AI is not only an opportunity for environmental health but is actively being utilized and holds incredible potential. Of course, the continued expansion of AI in environmental health and biomedical research will not be without challenges and bumps in the road, but we are now at the point where we can proactively work to improve data, develop clear and just governance structures, work towards reducing biases, ensure broad and equitable distribution, and training a diverse new workforce in these tools and technologies.”

__________________

⁴⁹ https://ephtracking.cdc.gov/DataExplorer/ (accessed July 11, 2023).

DISCLAIMER This Proceedings of a Workshop—in Brief was prepared by Lyly Luhachack and Natalie Armstrong as a factual summary of what occurred at the workshop. The statements made are those of the rapporteur(s) or individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies of Sciences, Engineering, and Medicine.

WORKSHOP ORGANIZING COMMITTEE This workshop was organized by the following experts: Carmen Marsit (Workshop Chair), Emory University; Yao-Yi Chiang, University of Minnesota; Christopher Duncan, National Institute of Environmental Health Sciences; Anindita Dutta, University of Chicago; Megan Latshaw, John Hopkins University; Gwen Ottinger, Drexel University.

ABOUT THE STANDING COMMITTEE ON THE USE OF EMERGING SCIENCE FOR ENVIRONMENTAL HEALTH DECISIONS The National Academies’ Standing Committee on the Use of Emerging Science for Environmental Health Decisions (ESEHD) examines and discusses issues on the use of new science, tools, and research methodologies for environmental health decisions. The ESEHD is organized under the auspices of the Board on Life Sciences and the Board on Environmental Studies and Toxicology of the National Academies of Sciences, Engineering, and Medicine, and sponsored by the National Institute of Environmental Health Sciences.

REVIEWERS To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by Anindita Dutta, University of Chicago, and Gengchen Mai, University of Georgia. We also thank staff member Daniel Talmage for reading and providing helpful comments on this manuscript.

SPONSOR This workshop was supported by the National Institute of Environmental Health Sciences (Contract No. HHSN263201800029I/Task Order No. HHSN26300003).

SUGGESTED CITATION National Academies of Sciences, Engineering, and Medicine. 2023. Advances in Multimodal Artificial Intelligence to Enhance Environmental and Biomedical Data Integration: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. http://doi.org/10.17226/27202.

Division on Earth and Life Studies