8
Office of Data and Informatics
INTRODUCTION
The Material Measurement Laboratory (MML) established the Office of Data and Informatics (ODI) in 2014 right before the last National Academies Assessment in 2014. As a service-oriented organization, it focuses on creating a modern environment for data and informatics for the MML, as well as for its customers. The mission of the ODI is “to build the infrastructure for next-generation data science tools and the management of complex data sets needed to support scientific innovation and advance open data concepts.”1 The ODI serves several functions which include the following: supporting national needs, such as the Materials Genome Initiative (MGI) and biological and chemical data integration, as well as the modernization of current NIST reference data services for use in state-of-the-art computer paradigms (i.e., virtual computing, parallel analysis, interoperability, semantic web, etc.) and the development of next-generation NIST reference data services. The ODI also facilitates MML’s adherence to the government’s open-data policy by providing guidance and assistance in the best practices for archiving and annotating research and data outputs. It also builds, concentrates, integrates, and coordinates capabilities needed to meet data challenges and leverage data-driven research opportunities (including Big Data and data.gov), particularly those that relate to the biological, chemical, and materials science communities within MML and, as the ODI grows, for all of NIST.2 The 2014 National Academies Assessment recommended that the MML provide resources to the ODI as rapidly as possible. The MML management did an outstanding job adopting this recommendation and established data science and data management capabilities as one of the five goals for the MML. The ODI started with 5 dedicated staff members and limited dedicated funds. The ODI now has 22 staff members and a $3 million budget. It is organized into two groups: the Data Services Group and the Data Sciences Group. The ODI is currently focused on four interrelated functional areas: modernization and curation of standard reference data (SRD); research data preservation and dissemination; in-house consultation services on informatics and analytics methods and tools; and open data/open science community engagement.
___________________
1 NIST Material Measurement Laboratory, “Material Measurement Laboratory Strategic Plan” https://mmlstrategy.nist.gov, accessed September 25, 2017.
2 NIST Material Measurement Laboratory, “Office of Data Informatics,” https://www.nist.gov/mml/odi, accessed October 11, 2017.
ASSESSMENT OF TECHNICAL PROGRAMS
Accomplishments
The National Institute of Standards and Technology (NIST) SRD program has been a successful example of data generation, curation, and distribution in the chemical, biological, and materials sciences. The ODI has contributed to SRD delivery by modernizing the existing NIST reference data services, providing community consultation and support, and participating in the global open data science community. The ODI has provided guidance and assistance to the MML for archiving research data in order to meet government-mandated rules and regulations on open data. While some tools are MMLspecific, others will be available to the broader scientific community. For example, it has already accomplished 10 percent of SRD modernization (and another 30 percent to come at the end of 2017) to improve user interfaces for web-based SRDs. Such improvements would make it possible for web-based SRDs to work on all computer platforms, would implement more consistent functionality, and would add application programming interfaces (APIs). It has also initiated internal reviews and solicited customer feedback on the entire SRD and special databases portfolio to set priorities for updating the remaining SRDs and eliminating obsolete products or products that can no longer be properly supported.
Additionally, it has engaged the Department of Commerce (DOC) Data Service Team for an SRD impact study, which is due in September 2018. It has also brought in Socrata to help build user interfaces and APIs, and it is making those available to research staff. This will enable the staff to make NIST reference data sets easier to search and download, both internally and externally, via the Internet. It has funded seven SRD enhancement projects for science groups with one-year seed money, with delivery expected in the summer of 2017.
The ODI developed a NIST-wide tool with the Office of Information Systems Management (OISM), called the Management of Institutional Data Assets (MIDAS), to simplify compliance. It captures key metadata and offers exports in standard formats, such as the Digital Object ID, uniform resource locator (URL), and enterprise data inventory record.
MIDAS supports automatic preservation to a NIST data repository that runs on NIST’s Amazon cloud enclave. The ODI is also leading the effort in developing the NIST Data Science Portal for the dissemination of NIST’s public data. Last, the ODI has been a representative for NIST in the International Chemical Identifier (InChI) Trust. The ODI established a consulting group for in-house consultation services on informatics and analytics methods and tools. In the areas of open data and open science community engagement, the ODI is an active contributor to many standards bodies and consortia and is starting to assume a leadership position in scientific data community.
Opportunities and Challenges
Considering that the ODI was established only 3 years ago, it has made impressive progress. Since its conception, the ODI’s technical activities have been largely focused more on data than on informatics. That is, their initial focus has been on implementing tools for producing curated, discoverable data sets, and ensuring their preservation. Now that they have succeeded in improving data access, it is time to work on tools that make the data more useful and easier to analyze. The MML needs to prepare a 3-year specific plan (roadmap) for the ODI to develop information and analytics tools. A roadmap will show management the importance of increasing their resources to enable informatics work and to solicit feedback from their customers on their plans, and a 3-year plan will enable the MML to show how the ODI will ramp up. In years 2 and 3, the ODI can obtain feedback from their customers that will enable them to do longer term planning.
More data sets and increasingly large ones are available in many science and engineering fields. Although data science has been around for quite some time, the tools and processes to learn from such scientific research data sets leave much to be desired. For example, there is a great need to enhance and
optimize the usability, discoverability, and interoperability of data. These needs create enormous opportunities for the ODI to establish NIST as a leader in data science and machine learning in biology, chemistry, and materials science. For example, one of the main projects in which the ODI is engaged is the Materials Genome Initiative (MGI) with the Materials Science and Engineering Division (MSED). Other projects and activities within the MML are also producing large amounts of data from which useful information could be extracted for learning. The ODI is in a great position to develop data informatics tools for manipulating complex data, as well as for mining that data to produce knowledge.
There is a lot of data and informatics activity in the majority of divisions within the MML (MSED, MMSD, CSD, BMD, and the ACMD). Right now, it appears that the MSED and MMSD have the most intimate engagement with the ODI through the MGI. It is a challenge to coordinate activities among different divisions within the MML to avoid duplicate efforts and different formats for the same information. On the other hand, coordinating and engaging with other data science efforts within the MML represents an opportunity for the ODI to learn the best practices and tools used by different divisions in handling data problems, and to share what they learn with all MML divisions.
The ODI is modernizing the delivery and e-commerce of SRD products to allow more automated access for customers. The experiences and knowledge learned through the modernization and curation of the SRD could be used to update other NIST products, such as the NIST Chemistry WebBook, which has been heavily utilized but still lacks an API that would make it easier to search and download, both programmatically and via a web browser.
In response to the federal policy mandate on open data, the ODI is helping to develop data management plans (DMPs) for the MML and building an infrastructure for data preservation and curation. The ODI could become directly involved in other major NIST programs, such as the Manufacturing Extension Partnership (MEP), to engage the internal and external customers for data challenges, including data plans, data preservation, and curation.
The ODI has made important progress in the modernization of SRDs for easy access and delivery. As soon as SRD projects are under control, the ODI could start developing a repertoire of analytics tools.
PORTFOLIO OF SCIENTIFIC EXPERTISE
Accomplishments
The ODI has assembled a strong team with complementary expertise in scientific data management, materials science, informatics, and analytics. In particular, the director of the ODI is a recognized authority in scientific data management for astronomy, which is a poster child for state-of-the-art scientific data management. He brought two members of his team from the Virtual Observatory at Johns Hopkins.
Opportunities and Challenges
The ODI currently has only two people available to develop and support the data informatics effort, while there are extensive needs for such efforts in the MML. The completion of SRD modernization will free up some staff for this activity. However, even more staff members will be needed to devote their efforts to informatics.
While the data infrastructure is being built, the logical next step is to fully utilize the infrastructure for data-driven science. Therefore, the ODI needs to ramp up its effort to build the data informatics and analytics capabilities to handle increasingly complex and data-driven research challenges. There are many data and informatics activities among the different divisions. Therefore, to address these research challenges, the ODI needs to take advantage of informatics, analytics, and statistics expertise in these other groups at NIST.
ADEQUACY OF FACILITIES, EQUIPMENT, AND HUMAN RESOURCES
Opportunities and Challenges
Data preservation is useless unless there is enough associated metadata to enable interpretation of the data years later. The same issue applies to publishing data to satisfy the open data mandate. However, it is tedious to capture and record that metadata manually. To avoid that manual effort and to free up scientists to do science, it would be beneficial if there were an automated system that captures the metadata from instruments and associates it with the data. Therefore to facilitate the data preservation and meet the open data mandate required for government agencies, the ODI needs to acquire or develop a Laboratory Information Management System (LIMS). Such a system could pull metadata with data from instruments, which would make such data easy to transform into publishable data.
The ODI also needs to be provided with sufficient office space for staff, and with the support necessary to modernize its computing facility.
DISSEMINATION OF OUTPUTS
Accomplishments
The ODI is on the verge of delivering some of its first major products. For example, it has developed initial capabilities for data preservation and dissemination in the Open Access to Research (OAR) project. As part of the NIST MGI effort, it is operating the Material Measurement Laboratory Repository Server,3 which is a data repository for data sharing. It is also developing the NIST Materials Resource Registry, a federated network of catalogs containing information about materials science resources that will be delivered in the summer of 2017. Furthermore, in collaboration with the Information Technology Laboratory (ITL), it has deployed and released several versions of the Materials Data Curation System (MDCS) for capturing, sharing, and structuring the materials data.
Opportunities and Challenges
The ODI needs to invest time in standards activities and consortia to establish NIST as a leader in scientific data management.
___________________
3 For the repository, go to “Material Measurement Laboratory Repository Server,” updated October 29, 2013, at http://materialsdata.nist.gov/.