National Academies Press: OpenBook

Designing the Archive for SHRP 2 Reliability and Reliability-Related Data (2014)

Chapter: Chapter 9 - Notes on Operations and Maintenance of the Archive

« Previous: Chapter 8 - Test Plan
Page 93
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 93
Page 94
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 94
Page 95
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 95
Page 96
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 96
Page 97
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 97
Page 98
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 98
Page 99
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 99
Page 100
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 100
Page 101
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 101
Page 102
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 102
Page 103
Suggested Citation:"Chapter 9 - Notes on Operations and Maintenance of the Archive." Transportation Research Board. 2014. Designing the Archive for SHRP 2 Reliability and Reliability-Related Data. Washington, DC: The National Academies Press. doi: 10.17226/22281.
×
Page 103

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

93 C h a p t e r 9 The design and operation of the Archive system depends not only on the requirements driven by the users but also on finan- cial, technical, and policy-related constraints. Although the L13 report attempted to shed light on those issues (e.g., life-cycle costs, archiving approaches), key strategic questions needed to be revisited and discussed for the L13A project given that the requirements had evolved since the inception of the project. In the L13A project, the team addressed various issues cru- cial to design, operations, and maintenance of the Archive system by developing white papers; these were put into dis- cussion among the members of TETG and the SHRP 2 team. The papers were structured in a manner that would provide solution alternatives and were intended to obtain stakeholders’ feedback. They reflected only the project team’s perspectives at the time of development and were designed to trigger internal discussions at the management level. The white papers were the basis for some of the key conclu- sions on the design and operation of the Archive. However, some of the final decisions made on the basis of the papers did not exactly follow the suggestions provided in the papers because of the evolving nature of the project. This chapter summarizes the project team’s assessment of various issues and final conclusions made on the basis of the papers. It also includes concerns that were raised in the white papers to draw the SHRP 2 management team’s attention to the Archive’s key operations and maintenance risks. The topics that the team investigated in the white papers are as follows: 1. Inclusion of user-submitted data in the Archive; 2. Operations and maintenance; and 3. Data ownership and personally identifiable information. 9.1 Inclusion of User-Submitted Data One of the outcomes of the June 4, 2012, L13A workshop was the subject matter expert (SME) panel’s suggestion to add a feature that allows Archive users to submit their processed/ transformed data sets and objects, derived from the original archived data, back into the Archive. SHRP 2 staff believed that feeding back user-generated products was aligned with the SHRP 2 strategic goals, so they were very interested in this idea. The project team investigated the implications of imple- menting this feature in the Archive system. The results are pro- vided below. 9.1.1 What Are the Submission Scenarios? The team proposes three user-submission scenarios. Note that any artifact submitted via any scenario is grouped as a user-submitted artifact. 9.1.1.1 Scenario 1 All users can upload flat files only: • Users could submit only flat files (file size restriction would apply). The system would treat the submitted object as a binary large object (BLOB). • The metadata requirements would be minimal. As a result, the submission process would be quick and short. • The administrator would need to validate the submitted file to make sure that it was not corrupted or infected, but no preprocessing step would be required. • Users would be able to submit their objects under the com- munity pages. 9.1.1.2 Scenario 2 All users can upload any files with no file type limit: • The ingestion process would be similar to the one for sub- mitting the SHRP 2 Reliability digital objects. Like any other Archive objects, user-submitted objects would need to be validated and preprocessed by the administrator and/or the user. Notes on Operations and Maintenance of the Archive

94 • Users would be able to submit any digital object that is accepted by the Archive system. • Users would be able to submit their digital objects from the project pages, the data set pages, and the community pages. • If the submitted object is a sensor data object, the user would be able to submit two types of data sets: the original file (in .csv format) that includes data extracted from vari- ous sensors/segments or a set of sensor-level/segment-level data sets (in .csv or .xls format) in which each set repre- sents data collected from a single sensor/segment. 9.1.1.3 Scenario 3 Trusted users can upload any files with no file type limit: • In terms of file upload constrains, this scenario is similar to Scenario 2. The only difference is that only a trusted group of users (in addition to PIs) could upload artifacts. At the time of writing the white paper, this scenario was not discussed as an option. It was added later after in-depth discussion with the SHRP 2 and FHWA teams. 9.1.2 Comparison Table 9.1 compares the three scenarios in terms of major func- tionality provided by the Archive system. This functionality includes list search, map search, full download, subset down- load, visualization, and collaboration. Based on the table, Sce- narios 2 and 3 would be able to support all of the functionality that is envisioned for SHRP 2 Reliability data objects. Table 9.2 compares the three scenarios based on various ele- ments that are important to the development and operation of the system. These factors are categorized under five groups: strategic alignment, cost, technology, administration, and risk to project and system. 9.1.3 Conclusion In general, the project team concluded that adding the user- submitted data feature was technically feasible. From the project team’s point of view, Scenarios 2 and 3 were more appealing because • They provide all of the envisioned functionality for the SHRP 2 archived data (see Table 9.1). • They use the submission system/procedure that the PIs use to submit SHRP 2 objects. Therefore coding efforts would be minimal. As a result, the team added a new artifact category, “user- submitted,” to the system. A feature was also implemented to enable users to report artifacts as “inappropriate.” The goal was to help the administrator identify irrelevant artifacts. The issue of PII was the biggest hurdle, which hindered availability of Scenario 2 (see Section 9.3 for more informa- tion). At the moment, the cost of employing a thorough moni- toring process to prevent users from submitting PII data is too high for SHRP 2. Therefore, per the SHRP 2 team’s request, the project team implemented only Scenario 3, in which only a trusted group of users, namely SHRP 2 contractors, can upload artifacts for the time being. Lastly, the team believes the adverse implications of needing excessive storage space to host user-submitted data are not sig- nificant enough, when compared with the benefits, as long as users submit valuable artifacts to the Archive. As a result, the team proposes an interim solution in which the operating entity creates a small group of trusted members. This group can leverage the already-developed ingestion functionality to submit external Reliability-related artifacts into the Archive. 9.2 Key Issues associated with Operations and Maintenance The core objective of this section is to review the various alternatives, as well as their implications, for the operations and maintenance (O&M) of the Archive system. This section tries to discuss the following questions: • Who is going to operate and maintain the Archive? • What are system O&M requirements? • What are hosting options for the SHRP 2 L13A system O&M phase? • What are the O&M costs? Table 9.1. Supported Functionality for User-Submission Scenarios Feature Scenario 1 (Users may submit flat files only.) Scenario 2 (All users may upload any files with no file type limit.) Scenario 3 (Selected users may upload any files with no file type limit.) List search c c c Map search ca cb cb Full download c c c Subset download cb cb Visualization cb cb Collaboration c c c a No sensor location. b Preprocessing of the data set is needed to leverage the feature.

95 9.2.1 Who Is Going to Operate and Maintain the Archive? Answering this question is beyond the scope of the L13A project, which is only concerned with archiving data from Reliability-related research and development projects. Future operation and maintenance of the Archive is an implementa- tion issue for others to determine, a topic that has already received substantial discussion. 9.2.2 What Are the System and Operations Requirements? The Archive is currently designed for 99% availability, using routine backup and recovery systems and processes. This guar- antees the availability of web pages. In case of disaster recovery, accessing the artifacts (especially data sets) may take longer. All options for the continued operations and maintenance of the system assume the same availability requirements and an operational methodology that sustains the system over the O&M term. The operational methodology includes quarterly updates to the application software and the supporting data- base software running the Archive, as well as bug fixes for the existing functionality, if issues are found. The following require- ments have been used to design the options described below for the L13A O&M phase. 9.2.2.1 Availability and Outage Tolerance Requirements • Annual availability 44 99% • Outage tolerance 44 Application outages are acceptable, but data need to be recoverable, and the annual availability needs to be met. 44 No outage will be greater than 72 h. (This would only occur with a major system failure; the new system would need to reindex the database.) Table 9.2. Effect of User-Submission Scenarios on Development and Operations Category Type Scenario 1 Scenario 2 Scenario 3 Note SHRP 2 strategic alignment Alignment with SHRP 2 strategic goals Medium High Medium Some features are not supported in Scenario 1. Cost Direct cost of hardware na na na Team will use cloud-computing model. Cost of software development Low Low Low Project team will leverage the existing data ingestion feature for Scenarios 2 and 3. Cost of Internet and web services (i.e., cloud) Medium High Medium Scenario 2 cost is higher because it requires more storage space. Recurrent/operation and main- tenance costs Medium High Medium Scenario 2 requires ample administration time to review the artifacts submitted by the regular users. Charges for provision of backup services and equipment Medium High Medium Scenario 2 requires larger storage for backed up files. Technology Back-end coding effort Low Low Low See “Cost of software development.” Alignment with user requirements Medium High High UI development/implementation effort Low Low Low Metadata entry effort Low High High Required database size Medium High Medium Administration Administration staff effort/cost for processing and validation of artifacts Low High Medium Scenario 2 requires more administration time to review the artifacts submitted by the users. Risk to the project success (budget/schedule risks) Low Low Low Adverse effect of technology evolution on the system operation Low Low Low Note: na = not applicable.

96 9.2.2.2 Backup, Storage, and Maintenance Requirements • Backup strategy 44 Hot backup—required; 44 Recovery testing—reconstruction of a working archive from backup artifact (annual); 44 Backup location—data center or cloud; and 44 Backup frequency—daily. • Software maintenance 44 Server patches and updates b CentOS, MySQL database (quarterly); 44 Software patches and updates b WordPress, PHP, Highcharts (quarterly); 44 Application bugs b Defects which impede archive functionality—ingest, search, or download (within 4 weeks); b Diagnose, patch, and release; • Disk space 44 Data storage up to 2 TB. 9.2.3 What Are the Hosting Options for the Archive? There are four options for the long-term O&M of the SHRP 2 L13A Archive system and its artifacts: • Option 1—server-based with server backup (using exist- ing data center); • Option 2—server-based with server backup (hosted at a highly available data center); • Option 3—server-based with cloud backup (hybrid); and • Option 4—cloud only (Amazon EC2). Each hosting option provides an effective strategy, but each option balances risks and costs differently. In the server-only– based model (Option 1), the costs are lower; the risks are a single point of failure and the timeliness of recovery in the event of a catastrophic issue. Those risks can be mitigated by transi- tioning the server-based system to a highly available data center (Option 2). In the server/cloud–based hybrid (Option 3) or cloud-based model (Option 4), the costs are marginally higher, but the data and applications risks are mitigated through cloud- based server models. Also, the potential risks regarding meeting the IT requirements of the system operator can be avoided in the cloud-based approach. Each option is described in more detail below. Option 4 is the recommended option. For all options, training is required as part of the transi- tion. Training would include 5 days of training material prep- aration, 2 days of inside training for system administrative staff, and travel to support training activities. The cost for transition training support is $2,500. 9.2.3.1 Option 1: Server-Based with Server Backup (Existing Data Center) The current server would continue to be the primary server running the SHRP 2 Archive. Additional details are provided below. • Changes to the current design 44 Purchasing a second server to support backup and hot recovery; • Benefits 44 Uses equipment already paid for, with only a marginal cost for a secondary server; 44 Redeployment is unnecessary; • Limitations 44 Power and network are a single point of failure; 44 Same facility supports delivery and backup of applica- tion and data; 44 Equipment will need to be replaced every 3 years; • Cost 44 New equipment b Additional disk space for existing server, b Backup server, b Backup system for backup server, b Installation, b Total: $6,500; 44 Staff support for two server-based systems (8 h/week): $75,000/year b Review and maintenance •4 Weekly deployment review, •4 Patches, •4 Functionality bug fixes; b Backup and recovery •4 Annual recovery verification, •4 Full backup (every 3 months) (1 to 2 TB), •4 Incremental backups 44 User-related tables (daily) (500 GB), 44 New artifacts (on upload) (10 GB)—triggered by administration process. 9.2.3.2 Option 2: Server-Based with Server Backup (Hosted at Highly Available Data Center) The current server would continue to be the primary server running the SHRP 2 Archive. Additional details are provided below. • Changes to the current design 44 Purchasing a second server to support backup and hot recovery; 44 Moving server location to highly available (HA) data center;

97 • Benefits 44 Uses equipment already paid for, with only a marginal cost for a secondary server; 44 Redundant power and network; 44 Space is available for servers at HA data center; 44 Could add warranties to extend the expected lifetime of server up to 7 years (approximately $140/year); 44 Spare systems are available on site; 44 The systems are configured with operating system (OS) installations on a dedicated redundant array of indepen- dent disks (RAID)–1 pair and data storage on a separate RAID-10 array; 44 Support staff are available 24/7; • Limitations 44 Equipment will need to be replaced every 3 years (unless warranty extension is used); 44 One-time additional cost to install system in the data center; • Cost 44 New equipment: b Additional disk space for existing server, b Backup server, b Backup system for backup server, b Installation, b Total: $6,500; 44 Transition to HA data center b Cost to move and reinstall: $2,000; 44 Staff support for two server-based systems (8 h/week): $75,000/year b Review and maintenance •4 Weekly deployment review, •4 Patches, •4 Functionality bug fixes; b Backup and recovery •4 Annual recovery verification, •4 Full backup (every 3 months) (1 to 2 TB), •4 Incremental backups 44 User-related tables (daily) (500 GB), 44 New artifacts (on upload) (10 GB)—triggered by administration process. 9.2.3.3 Option 3: Server-Based with Cloud Backup (Hybrid) The current server would continue to be the primary server running the SHRP 2 Archive. Additional details are provided below. • Changes to the current design 44 Using Amazon S3 for backup and hot recovery; • Benefits 44 Uses existing equipment for primary server functions; 44 Uses on-demand Amazon service to back up data once a day; 44 Provides off-site risk mitigation with data stored in a secondary location; 44 Minimizes cost by limiting the number of backups a day; 44 Only pay for the hours that the backup runs and the cloud instance needs to function as the primary server during recovery; • Limitations 44 In the event of a catastrophic failure the maximum amount of data loss is 24-h of data; 44 Marginal cost of an Amazon S3 backup/secondary server is more than a secondary server; 44 Single server point of failure—same facility supporting delivery of data, only data backup in the cloud, not appli- cation or server; • Cost 44 On-demand, large instance (Amazon S3): $8,000/year; 44 Transition to cloud database backup b Cost to move and reinstall: $1,000; 44 Staff support for two systems (8 h/week): $75,000/year b Review and maintenance •4 Weekly deployment review, •4 Patches, •4 Functionality bug fixes; b Backup and recovery •4 Annual recovery verification, •4 Full backup (every 3 months) (1 to 2 TB), •4 Incremental backups 44 User-related tables (daily) (500 GB), 44 New artifacts (on upload) (10 GB)—triggered by administration process. 9.2.3.4 Option 4: Cloud Only (Amazon EC2) The current server would be decommissioned and Amazon EC2/S3 would be the primary server running the SHRP 2 Archive. Additional details are provided below. • Changes to the current design 44 Using Amazon EC2 for primary and backup application server and data; • Benefits 44 No equipment to support or replace; 44 Data and applications are stored in a redundant system; • Limitations 44 Cost is higher than server version and slightly higher than the cloud data backup version; • Cost 44 For heavy reserve, large instance (Amazon EC2/S3): $8,500/year; 44 Transition to cloud hosting and database backup b Cost to move and reinstall: $2,000;

98 44 Staff support for two cloud-based systems (8 h/week): $75,000/year b Review and maintenance •4 Weekly deployment review, •4 Patches, •4 Functionality bug fixes; b Backup and recovery •4 Annual recovery verification, •4 Full backup (every 3 months) (1 to 2 TB), •4 Incremental backups 44 User-related tables (daily) (500 GB), 44 New artifacts (on upload) (10 GB)—triggered by administration process. 9.2.3.5 Hosting Options Summary Each system design described in this section provides an O&M solution for the Archive. The options include a range of physical and cloud-based machines with different configu- rations for the servers and the supporting database infra- structure. Embedded in each option’s technical details are variations of risk for system availability (uptime) and recov- ery strategy (see Figure 9.1). Given the program’s need for a large scalable Archive system, the current uncertainty of which agency will support the Archive in the long term, and the desire for redundancy of data and uptime support, having a flexible and scalable system is important. Therefore, the rec- ommended option is Option 4. Option 4 provides the highest flexibility of server maintenance and data transfer and risk management. Using cloud services through a scalable system like Amazon lowers the O&M risks to the L13A system and provides redundant safety for the Archive. Amazon maintains the physical equipment and supporting infrastructure, and the contract selected with Amazon can be adjusted if the ser- vice needs to be supported in a different way in later years. The L13 report also suggested a hosting approach similar to Option 4. 9.2.4 What Are Operations and Maintenance Costs? 9.2.4.1 Annual Costs A summary of the four options is provided in Table 9.3 and a summary of their O&M costs is provided in Table 9.4. The cost elements are described in more detail below. 9.2.4.2 Cost Elements To run the Archive during the O&M phase requires both one- time and ongoing costs. The one-time costs include equipment (servers, hard drive disk space), installation of equipment, and the management of the system transition (program manage- ment, transition costs for installation/transfer of equipment, and training). The following definitions describe the one-time and annual costs. 9.2.4.2.1 One-Time COsTs • Disk space—cost of external hard drive to back up the code, artifacts, and other files; • Backup server—cost of the mirror server that is used when the original server fails to operate; • Backup system—cost for the equipment used to run back- ups of the applications server and database files for the backup server; • Installation—costs to install and configure new supporting physical computer equipment; • Server transition—costs to transition existing L13A Archive to redundant facilities, whether the facilities are at a physi- cal location or provided by a cloud service like Amazon; R isk 1 Server Based •Application Server (Existing Data Center) •Server Backup 2 Server Based (HA) •Application Server (High Availability Data Center) •Server Backup 3 Hybrid •Application Server (Existing Data Center) •Amazon Data Backup 4 Cloud Based •Application Server (Cloud) •Database Server (Cloud) Higher Lower Figure 9.1. L13A options uptime risks.

99 • Training transition—costs to train the administrator(s)/ operator(s) of the L13A Archive on revised design for O&M and teaching the process of administering the Archive during the O&M phase; and • Project management—costs to manage the transition to the O&M phase. 9.2.4.2.2 AnnuAl suppOrT COsTs • Server warranty—cost to purchase a warranty for physical servers that guarantees availability of parts and timely ser- vice by the equipment manufacturer; • Annual cloud, on demand—cost to provide on-demand cloud server and database computing units; • Annual cloud, heavy reserved—cost to provide reserved cloud server and database computing units; and • Support years—number of years used to calculate annual cost values (Amazon hosting costs). 9.3 Managing Issues with Non–Shrp 2 Data It was stated earlier that data in the SHRP 2 Archive comes just from SHRP 2 Reliability-related projects. The Archive is currently configured to house a static data set; in the future, though, it could be easily and quickly reconfigured for use by others to add new Reliability-related data. A major concern is that the SHRP 2 Archive not contain data that can be used to personally identify individuals. PII data is simply not allowed in the Archive, and numerous steps have been taken to enforce this: 1. Nearly all the travel time data comes from loop detec- tors. Travel time from loop data is calculated from data pertaining to many vehicles that pass over a loop in a time slice such as a 5-min period. Thus it is not possible Table 9.3. Archive System Options Summary Option Type Primary System Backup Location 1 Server-based Server Server Existing data center 2 Server-based Server Server High availability data center 3 Hybrid Server Cloud (data only) Existing/HA center/Amazon 4 Cloud-based Cloud Cloud Amazon Table 9.4. Archive System Operations and Maintenance Costs Summary Item Options 1 2 3 4 Disk space $500 $500 $0 $0 Backup server $3,000 $3,000 $0 $0 Backup system $2,000 $2,000 $0 $0 Installation costs $1,000 $1,000 $0 $0 Server transition costs $0 $2,000 $1,000 $2,000 Training Transition $2,500 $2,500 $2,500 $2,500 One-time costs $9,000 $11,000 $3,500 $4,500 Annual staff support $75,000 $75,000 $75,000 $75,000 Server warranty $0 $140 $0 $0 Annual cloud, on demand $0 $0 $8,000 $0 Annual cloud, heavy reserved $0 $0 $0 $8,500 Support years 1 1 1 1 Annual costs $75,000 $75,140 $83,000 $83,500 Total costs $84,000 $86,140 $86,500 $88,000

100 to identify individual vehicles from the loop data in the Archive. 2. For traffic detection technology that can be used to iden- tify origins and destinations, in accordance with standard practices, derived trip lengths have been truncated at both ends so origins and destinations cannot be identified. 3. Standard practices have been employed so that no per- sonal identifiers are associated with the data in the Archive (e.g., personal or machine identifiers have been removed from the record for an individual driver). In addition, it is important to bear in mind that all the data were generated under contracts of the National Academy of Sciences. Under the contracts, all subject data—including the wide range of types in the SHRP 2 Archive—are owned by the National Academy, and the Academy may authorize others to publish any of the data. SHRP 2 contractors furnishing data to the Archive are fully cognizant of these provisions and, to the best that can be determined, removed all PII and propri- etary data from their deliverables so as not to inhibit compli- ance with the Academies’ contract provisions. To further assure the absence of PII data in the SHRP 2 Archive, a national laboratory has been conducting an inde- pendent investigation of the data in the SHRP 2 Archive to make sure there is no PII data in preparation for the Archive’s implementation. This section provides a review of industry practices for data rights protections, author attribution, and options for protecting PII that might be inadvertently added by users of the Archive in the event that the Archive is opened up to non- SHRP contractors in the future. To enable such users to submit Reliability-related artifacts, the project team offers options to manage data licenses and address PII data protec- tions (in case the user artifact upload feature becomes avail- able). The proposed options in this section have not been implemented in the Archive and are raised only to help those concerned with these issues make informed decisions on the topic should a decision be made to turn the SHRP 2 Archive into a dynamic repository in the future. 9.3.1 Open Data SHRP 2 recognized the benefits of an open system by requir- ing an open data structure for the outputs of SHRP 2. Specifi- cally, the intent was to design the Archive to be a collective and open data source to foster future research. While open systems encourage collaboration and access, the contribution of user-generated data, and therefore open data sets, adds complexity. Uncontrolled data require guiding principles for data ownership rights, data use requirements, and protection of private data. These principles will need to be effectively managed as a set of requirements that are followed by any contributors to, and users of, the data after the Archive has been fully populated with SHRP 2 Reliability-related data, as originally intended. These issues were raised during the Janu- ary 2013 stakeholder meeting at the Transportation Research Board annual meeting and are further analyzed here. SHRP 2 is not the first to provide an open archive to researchers. Fortunately, the benefits of later adoption are sig- nificant, as SHRP 2 can learn from the existing open models and their implementation of data rights management. In the last 5 years, many new open data sites have been created by the public sector. As data rights provisions are legal terms, the team examined data rights protections used in open data implementations that follow the same or similar legal struc- ture as would apply to SHRP 2 data. 9.3.2 Open Data Licensing Options The project team proposes two licensing options. 9.3.2.1 Option 1: Creative Commons License One of the common content licensing tools used by a large number of sites is Creative Commons (CC). CC licensing pro- vides a structure that is simplified, describes legal terms in plain language, and offers machine-readable licenses that can tell automated programs, including search engines, the license terms. Individuals can then include or exclude data with spe- cific license types from their queries. 9.3.2.2 Types of Creative Commons Licenses There are seven versions of CC licenses. The first is for “no known copyright” works, called a public domain license. The license graphic that accompanies a public domain artifact is shown in Figure 9.2. If the artifact is not in the public domain, six other licenses are available—with four variables that can be selected. The licenses shown in Figure 9.3 are from the CC website at http://creativecommons.org/licenses/. The four CC license variables are the following: • Attribution ensures that authors of the artifact are men- tioned appropriately in derivative works for commercial or noncommercial use. • Share-alike requires users of the artifact to license any derivative works under the same license terms. • No-derivatives allows use of the artifact as is, but does not allow derivatives of the work to be created. Figure 9.2. Creative Commons public domain license.

101 • Noncommercial allows noncommercial use of the artifact, but does not allow commercial use. The main advantages of CC licensing are clarity of license terms, ease of use, and machine-enabled rights tracking. CC offers a clear path to users in the license selection process and tools to see the specific legal terms for more sophisticated legal reviews (see http://wiki.creativecommons.org/Before_ Licensing). 9.3.2.3 Option 2: Open Knowledge Foundation—Open Databases Another data rights management system, specifically designed for databases and the content of databases, is supported by the Open Knowledge Foundation (OKF) Project (http://open datacommons.org). The OKF is a nonprofit organization based in Great Britain that supports open data projects around the globe. It discourages limitations on data, as its mission is to foster transparency and openness through the opening of data. To support the licensing of databases and data, the OKF provides three license types (http://opendatacommons.org/ licenses/): • Public Domain Dedication and License (PDDL)—public domain for data/databases; • Attribution license—attribution for data/databases; and • Open database license—attribution share-alike for data/ databases. Unlike the CC licenses, OKF licenses do not provide any limitations for commercial use or nonderivatives. The foun- dation provides a narrative that illustrates the differences in database versus data license needs for databases that might be controlled by the author and data that might be controlled under different license terms. The narrative is located at http:// opendatacommons.org/faq/licenses/#db-versus-contents. The narrative describes how to treat the different databases in terms of homogenous databases and nonhomogenous data- bases (see http://opendatacommons.org/faq/licenses/#db- versus-contents, where the license descriptions below were obtained). When the user controls the database and its content, Figure 9.3. Creative Commons license types.

102 the OKF calls the database homogenous and uses the following rights permissions: • Share-alike. Use Open Data Commons Open Database License (ODbL) plus Database Contents License (DbCL) or some other suitable contents license of your choosing. • Public domain. Use PDDL (it covers both the database and contents). When the owner of the database and the content of the data- base are different, the OKF calls the database nonhomogenous and uses the following rights: • Share-alike. Use ODbL for database qua database, plus whatever license you wish/can for contents. • Public domain. Use PDDL for database qua database, plus whatever license you wish/can for contents. Note that the CC licenses could be used in conjunction with the OKF licenses in the latter cases to appropriately license the content. 9.3.2.4 Managing Data Rights Whether Creative Commons, Open Knowledge Foundation, or an alternative licensing form is used, to ensure appropriate treatment of databases and contents of databases and the appropriate digital rights management, a business process should be in place to request that the submitting individual supply the data rights requirements of submitted databases as well as database contents. This can be done through user- based license selection and business processes written into the Archive that capture the input and the proposed license in an administrative review before posting the data, database, or other artifact to the system. To manage data rights of an open archive, many sites use open data portal software back ends that provide mecha- nisms for titling and licensing data sets. To ensure appropri- ate data rights attribution, the business process for upload and data management can be managed to ensure that users select the license to submit; an administrator is able to review the submission before the data are available to the public. This forms-based process ensures that the resulting metadata contains the license terms. 9.3.3 Personally Identifiable Information As stated above, user-contributed data sets are not permitted now but potentially can be after completion of this project. The goal of allowing users to contribute artifacts and data sets to the SHRP 2 Archive is to expand the amount of data available to researchers for future innovations and discovery. With user-contributed data, there is a risk that users could upload data that contain PII. The goal here is to raise aware- ness regarding this issue and its potential solutions. The Recommendations for Standardized Implementation of Digital Privacy Controls (U.S. Federal Chief Information Offi- cers Council 2012) expands on a strategy document, Digital Government: Building a 21st Century Platform to Better Serve the American People (White House 2012). The two docu- ments refer to the public-sector role in data protection in the following way: “as good stewards of data security and privacy, the federal government must ensure that there are safeguards to prevent the improper collection, retention, use or disclo- sure of sensitive data such as PII.” A more formal definition of PII is provided in the April 2010 special publication by the National Institute of Standards and Technology called the Guide to Protecting the Confidentiality of Personally Identifi- able Information (McCallister et al. 2010). The two solutions to the PII issue are as follows: 9.3.3.1 Option 1: Reviewing and Managing PII Risk The strategy and recommendations white papers also defined steps for handling and mitigating risk with PII. How to review PII in the SHRP 2 Archive and whether these are the appro- priate procedures to follow to review and manage PII risks should be considered. 1. Define PII and minimize the retention of PII (U.S. Depart- ment of Justice 2010): a. Complete a Privacy Threshold Analysis (PTA) (U.S. Department of Homeland Security 2012); b. Define PII for the SHRP 2 Archive (McCallister et al. 2010); c. Determine if data are linked or can be linked (“link- able”) to a specific individual; d. Use an existing System of Records Notice (SORN) or draft a new SORN, if required (U.S. DOT 2014); e. Determine the role of PII in the inventory; and f. Determine PII elements that are permitted. 2. Inventory and manage: a. Inventory PII in existing files, called an Initial Privacy Assessment (IPA); b. Manage PII for existing data by protecting, removing, or making the data not linkable; and c. Manage the process of new data sources for PII. 3. Review: a. Run periodic reviews of artifacts to determine if PII policies are being enforced. If it is determined in the PTA that a PIA (Privacy Impact Assessment) is necessary, the National Institute of Standard

103 and Technology (NIST) recommends asking the following questions in the PIA review (McCallister et al. 2010): • What information is to be collected? • Why is the information being collected? • What is the intended use of the information? • With whom will the information be shared? • How will the information be secured? • What choices has the agency made regarding an IT system or collection of information as a result of performing the PIA? The NIST guide recommends several other methods that could be used to check whether PII exists in the SHRP 2 Archive and whether files that are added by users contain PII, including “reviewing system documentation, conducting inter- views, conducting data calls, using data loss prevention tech- nologies (e.g., automated PII network monitoring tools), or checking with system and data owners” (McCallister et al. 2010). Furthermore, the scope of determining how to man- age PII is contingent on the risk associated with the PII data. The NIST guide also reviews how to measure risk for PII, including impact-level definitions for low, moderate, and high risk PII; factors for determining PII confidentiality impact-level procedures of the type recommended by NIST should be followed during the PIA. Following the procedures above would align the SHRP 2 Archive with other federal data sets following the latest guidance and requirements for data protection. 9.3.3.2 Option 2: Defining a Formal PII Process It will be particularly important to define a process for users to follow and terms to agree to when, in the future, they are allowed upload data sets to the SHRP 2 Archive. When users upload files and data, it is customary to post an agreement to legal terms. The user needs to accept the terms to proceed. The language should indicate that the data being uploaded are free of PII and that the data have become unlinked or anonymous. Additionally, before becoming accessible to users of the Archive, the data should be posted to an admin- istrative area for a review of the data set to determine if the data contain any PII. Once data are in the Archive, the administrator must serve as a data steward who performs an initial review of all the data. For example, the administrator could use a set of automated tools or manual processes to review the file(s) that will be uploaded. Automated processes to check and ensure anony- mous data could be applied for known PII patterns, such as Mac addresses from Bluetooth readers. These processes do not provide a foolproof mechanism, but each step reduces the risks and assigns traceability to the appropriate parties. (Realisti- cally, no one is going to have a reason to go to the time or trou- ble to download Bluetooth data from the SHRP 2 Archive and try to infer personal information from a trip record for which an equipment identification number has been expunged and then try to go to the next step to link a trip to an individual.) While the Archive is actively managed, it is desirable to conduct periodic reviews or audits to check the data files for PII, and best practices/policies should be followed.

Next: References »
Designing the Archive for SHRP 2 Reliability and Reliability-Related Data Get This Book
×
 Designing the Archive for SHRP 2 Reliability and Reliability-Related Data
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s second Strategic Highway Research Program (SHRP 2) Report S2-L13A-RW-1: Designing the Archive for SHRP 2 Reliability and Reliability-Related Data explores the development, testing, and deployment of the SHRP 2 Reliability Archive system. This archive is a repository that stores the data and information from SHRP 2 Reliability and Reliability-related projects.

This project also produced a document that outlines the high-level architecture of the SHRP 2 Archive system.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!