Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
46 Best Practices This section provides supporting information on how to best use the data structure and pro- posed tools. While not exhaustive, the best practices described here are targeted to common situations or issues identified in transit agency interviews, stakeholder workshops, and testing of the data structure. As with existing data structures and specifications, as use grows, this data structure may evolve, and additional best practices may be added. 6.1 Versioning Data ITS vendors typically provide transit agencies with data at regular intervals (e.g., daily, weekly). They may also provide transit agencies with new versions of historical data that have been corrected after the fact. When updated versions of ITS data are provided to transit agencies, agencies should process this data using the tools described in this report, including the Format Validation, Data Quality, and Data Transfer Tools (if using). As these tools also integrate GTFS and Supporting Data Files, agencies should ensure that the GTFS and supporting files correspond to the same time period as the ITS data. We recommend that agencies maintain old versions of data for future reference even after new versions of data are received, processed, and stored. It is best practice to mark old versions of data as deprecated. Using the date of data receipt in file names can also help track multiple versions of data. Notes on changes made to the data should accompany new versions of data. Deprecated versions can be deleted after some time interval; agencies should determine a period of regular purging based on storage capacity and usefulness of keeping archival data. 6.2 Uniting Data from Multiple ITS Sources Often agencies have multiple sources of the same type (e.g., AVL, AFC, or APC) of ITS data. Different vehicles or vehicles from different modes may be equipped with hardware from dif- ferent vendors, creating data outputs with varied formats, including variations in field names or fields included. Where systems are redundant, agencies should work to identify which systems are best suited for generating KPIs by balancing data quality with coverage. For example, while one set of devices may be on 100Â percent of vehicles, it may have known data quality issues mak- ing it inferior to a newer, more advanced set of devices that are on 50 percent of vehicles. In this example, it may be worthwhile for an agency to use the inferior device data only for vehicles that do not have the newer, advanced devices. In the case where more than one system collects the same type of data, such as the AFC system collecting fare transactions and an APC system collecting boardings, the agency may want to C H A P T E R 6
Best Practices 47  store both sets of data, adding an attribute field to the dataset designating the system of source information. To manage data originating from systems with different formats, agencies have two options: 1. Use distinct tools to process each set of Vendor Outputs into either the Event or Summary Data Files; or 2. Bind Vendor Outputs prior to using a tool to process the combined Vendor Outputs into either the Event Data Files or Summary Data Files. When binding data, fields that are common to each dataset should be combined into a single table, with an additional field noting the data source created for tracking purposes. The second option makes the most sense when most fields are similar, the data originate from a single mode, and each record is at the same level of granularity. The first option may be preferable when fields vary widely, or data originates from different modes. If datasets are not at the same level of granularity (i.e., in one dataset each record is one stop visit on a single trip instance while in another dataset each record is an average of multiple stop visits on a scheduled trip over the course of a month) then they should not be bound before processing into the Event or Summary Data Files. 6.3 Integrating the Data Structure with GTFS The data structure integrates with GTFS primarily through three fields: stop_id, stop_sequence, and trip_id. Matching these fields between this data structure and GTFS ensures consistency and can enable analysis and tool development using data across both structures. GTFS data are typically released periodically, reflecting schedule changes over time. To inte- grate GTFS and with the ITS data structure, identifiers should be kept as consistent as possible between schedule periods to enable easy analysis across periods. For example, when trip start times are shifted only slightly from one version of a timetable or schedule to the next version of a timetable or schedule, but no trips are added to the timetable, maintaining the same trip_id allows for easy aggregation and comparisons across schedule periods. Only when trips are added, runtimes adjusted, headways changed, or start times shifted significantly, should new GTFS trip_ids be generated. While also beneficial to GTFS-specific analysis, this consistency is especially beneficial to ridership and performance analysis sourced from ITS data. Some regions have distinct GTFS feeds for different modes or different operators or service providers that serve the same region. Because the transit ITS data structure integrates data across modes and operators, it is critical that identifiers are not repeated across modes or service providers. 6.4 Cleaning Data The Format Validation and Data Quality Tools, described in Chapter 1, are each designed to flag potential errors and issues with the data. This section provides some best practices for addressing these errors and issues. 6.4.1 Formatting Validation Formatting issues identified in the Format Validation Tool typically need to be resolved through an investigation into processes, as they usually result from coding issues from data- producing and data-processing systems. ⢠ID fields: For fields that are used as IDs and fields that form the unique key for a file, check that the processes for generating these IDs are not generating duplicates. Duplicates may
48 Improving Access and Management of Public Transit ITS Data occur when records from multiple sources are combined. (Section 6.6 has suggestions for generating unique IDs.) ⢠Enum fields: For enum fields, errors will occur if values do not conform to the specified avail- able enum values for the field. The best practice is to develop and maintain documentation on how data from a particular system should be represented and mapped to the available enum values. This might be an exploratory and iterative process. 6.4.2 Data Quality Flags, Missing Data, and Poor-Quality Data The data quality tool flags data that are either missing or of poor quality or questionable validity. Data quality issues can be dealt with in a discretionary manner, with two main alter- natives: populating missing data with imputed or otherwise valid values or designating data as missing. ⢠Populating missing or questionable data: With this approach, analysts may opt to cap spe- cific data items at a predetermined minimum or maximum value (such as setting the maximum bus speed at 45 mph or another limit), or they can impute a default value based on other valid entries (for example, using the median or the mean). This approach is most appropri- ate when questionable data occur randomly and are not attributable to a systematic error. This approach may also be applied to estimate KPIs where precision is not critical. However, practitioners should take care to determine if this process is not biasing estimates. If so, it should not be used. ⢠Designating data as missing: With this approach, data are simply marked as missing using not applicable (NA) or similar. For example, if in using the data quality tool, unrealistic bus speeds of 60 mph on a trip are found, the fields used to derive the speed would be set to NA, as if the data had been missing. When designating missing data, it is important to set individual fields to NA and not delete records from the dataset. This approach is appropriate when there is limited information on the source of the data quality issue or precise data are required. Whichever alternative is applied, it is recommended to maintain both the original and cleaned version of the data. Versions should be dated and labeled as âoriginalâ and âcleaned.â âCleanedâ versions should be accompanied by notes on the cleaning processes applied. 6.4.3 Designating Start and End of a Trip One of the most consistent data quality issues with AVL and APC systems is detecting each tripâs correct start and end. Resolving this issue is important for correctly assigning data records to trips within the data structure. Agencies should review any geolocation data that are stored for the initial and final stops of each trip to ensure it is correctly located where vehicles actually end, start, and layover. If needed, it may be useful to create additional stop geolocation designations reflecting these locations so the start and end of trips are captured correctly by ITS software. Additionally, it might be necessary to adjust geofencing rules within the ITS software so that the start and end of routes and lines are captured accurately. For example, if a single stop is used as a layover location for multiple bus routes, it may cause vehicles to stack along a curb. Based on the geofencing rules and the geolocation of the stop, vehicles arriving and laying over at the far end of the location may not cross the geofence until beginning the next trip, thereby falsely flagging the previous trip as late. These types of data issues can be identified by agencies through periodic analysis of the data to see where timepoints along the route arrive on time, but only final timepoints arrive late.
Best Practices 49  6.5 Maintaining Information on Different Types of Trips ITS information should always be linked to a trip_id_performed, regardless of whether the trip was scheduled, unscheduled, or was a deadhead (or non-revenue) trip. Each trip should have a unique ID that represents the trip instance. If the trip was scheduled, it will have both a trip_id_performed and a trip_id_scheduled. When scheduled trips are not operated, it is important that data reflect this so that missing trips are not mistaken for unsampled trips or trips that operated but data are missing. In the Trips_performed file, the schedule_relationship field enables transit agencies to document several trip designations: scheduled, skipped, missing data, unscheduled, canceled, duplicated, and schedule modified. Transit agencies should maintain definitions for each of these catego- ries within their agency. To populate this field, they should develop processes that may include record-keeping by operators or dispatchers, as well as inferences from automatically collected data. If agencies know that some scheduled trips will not be sampled because they only have ITS devices on some of their fleet, they should keep records of which trips are sampled and which are not. The Trips_performed file also includes the in_service field to maintain information on trip types including non-revenue trips. Again, transit agencies will need to develop methods to pop- ulate this field. They may include flags in automated data for different trip types, such as dead- heads or layovers. Or they may develop processes to infer trip type based on the trip trajectory. Or operators or dispatchers may maintain records of these types of trips. 6.6 Generating and Maintaining Unique Identifiers Using serial numbers to uniquely identify records is common practice. Serial numbers can guarantee uniqueness but are not inherently meaningful and thus hard to quickly parse, so a meaningful prefix can be a useful addition. An alphanumeric prefix may indicate that a record pertains to a specific category or location. For example, the trip_id_performed might begin with a prefix representing the mode, route number, or depot, depending on what may be useful to the transit agency for analysis or quality control. Typically, three to five prefix characters are enough to distinguish between entities without adding excessive length to the ID. Ideally, the number of characters (letters or numbers) that make up a unique identifier is held constant. This makes the entire ID easier to parse using regular expressions. 6.7 Using Specific Fields 6.7.1 Date and Time Fields The specification uses GTFS definitions for dates and timestamps. Date refers to service_date. The service_date refers to the date associated with the tripâs service. A service_date may include trips that run after midnight on the following calendar date, as these are often part of the same service block. As such, a single service_date may be associated with multiple calendar dates. All the time fields use POSIX time, which is defined as the number of seconds since January 1, 1970, at 00:00:00 Coordinated Universal Time (UTC). Therefore, there is no issue including times after midnight. Agencies will need to convert scheduled times from GTFS from the HH:MM:SS format to POSIX time for integration in the data structure. If desired, agencies can maintain additional time fields using the HH:MM:SS format for readability. However, POSIX time, as used in GTFS-Realtime, is standard for timestamp data, and it provides a universal time across time zones.
50 Improving Access and Management of Public Transit ITS Data 6.7.2 Agency-determined Categories in fare_transactions Some fields in the fare_transactions table are intended to work similarly to enum fields, in that a limited number of values are expected, but with the set of possible values determined by the transit agency in producing the table. For example, the rider_category field is a text field that, in practice, may be limited to a few values such as âAdult,â âYouth,â or âStudent.â Some transit agen- cies may have different relevant rider classes, such as âUniversity Facultyâ or âTransit Agency Personnel.â As with enum fields, best practice is to develop and document a clear mapping procedure that makes coding conventions clear. 6.7.3 Field Stop_visits.stop_id vs Stop_visits.stop_sequence When a vehicle visits a stop, the stop itself is identified by the Stop_visits.stop_id field, and the sequence within the trip is indicated by the Stop_visits.stop_sequence field. The stop sequence is necessary to uniquely identify a visit to a stop because some trips may visit a stop twice. 6.7.4 Field Devices.device_id The device_id can be used to associate data with a specific device located on a vehicle, at a stop, or at a station. Examples of devices include a fare collection device, a door sensor that counts pas- senger entries or exits, or a vehicle location device. Like other unique IDs, the primary criterion for a device ID is that it is unique. It does not have to be different from a vehicle ID, and in fact, if there is only one or a primary device associated with a vehicle, the vehicle id can simply be used as the device ID. Similarly, a stop ID for a station with only one device may be used as the device id. 6.8 Using the Data Structure in Special Cases Some special use cases are detailed in the following subsections. 6.8.1 Multiple Operators for a Single Trip If multiple operators operate a single trip, the single trip should be split into two records in the Trips_performed table, one representing each leg of the trip operated by a single operator. Multiple records in the Trips_performed table may refer to the same scheduled trip in the GTFS Trips file. 6.8.2 Trains and Other Vehicles That Reverse Direction Since trains frequently stop at a destination to complete a trip, then reverse direction to begin a new trip, the same train consist may be used with a different order of component cars. Records in the Train_cars file representing individual train cars are associated with vehicles (train consists) via the Vehicle_train_cars file. There are two ways to represent a train that reverses direction using the Vehicle_train_cars file: ⢠If the Vehicle_train_cars.order field is not used, the same record in the Vehicles file can represent a train as it moves in both directions. This method can be used if the order is not important for a transit agencyâs analysis use cases. ⢠Alternatively, if the Vehicle_train_cars.order field is used, a separate Vehicle_id and record in the Vehicle_train_cars and Vehicles files must be created for each direction, reflecting the two different orders.
Best Practices 51  6.8.3 Payments Made Off-Board and Mobile Payments All fare transactions are associated with a device, but that device may not be located on a vehicle. If the device id is associated with a fare gate in a station, or a mobile phone payment, it may not be possible to associate the payment with a specific trip performed. For recording purposes, the data structure does not require fare transactions to be associated with a trip. ⢠For fare payments made at a station, the device can be associated with a station and not a vehicle in the Devices table and the information can be stored in the Fare_transactions file (for individual transactions) and in the Station_activities file, where transactions are associated with a stop and time period. ⢠For mobile payments, transit agencies have different options depending on what data they receive. If they have data on individual mobile transactions, each mobile phone could consti- tute a device ID that is neither associated with a stop nor a vehicle. The transactions can then be stored in the Fare_transactions file. Alternatively, if they receive aggregated data on mobile payments, they can generate a special stop ID for mobile payments that does not match any existing stop IDs. The transactions can then be stored in the Station_activities file. 6.8.4 Payments Associated with a Trip or Vehicle In some cases, transit agencies receive data on which trip or vehicle a transaction occurred but lack data on the specific stop where the transaction occurred. In these cases, the stop_id and stop_sequence fields may be left null in the Fare_transactions and the Stop_visits files, and the fare transaction information would apply to the provided identifiers (i.e., for an entire trip, a vehicle, and/or a device) for that service date. Alternatively, if a transit agency standardly collects trip-level revenue, they could add a field for aggregate revenue for each trip to the Trips_performed file, which would be populated directly from the Vendor Outputs. Similarly, if a transit agency standardly collects vehicle-level revenue, the agency could add a vehicle-based file that is similar to the Station_activities file to track revenue collected by each vehicle over exclusive time periods. 6.8.5 Boardings and Alighting Associated with a Specific Train Car As described in Section 3.2.1, some transit agencies may desire to track boardings by train car, in particular, to measure crowding along a rail platform. By default, this data structure enables transit agencies to store vehicle boarding and alighting data either by the device (in the Passenger_events file) or by vehicle, stop, and trip (in the Stop_visits file). The Stop_visits file allows data to be recorded by front and back (or left and right) door, but for multi-car train consists, some transit agencies may wish to record data by vehicle, stop, trip, and train car. This can be accomplished through a special file that summarizes boardings and alightings at the vehicle, stop, trip, and train car level. This file could be summarized from the Passenger_events file. Alternatively, if a transit agency is not using the Event Data Files, it may be generated directly from Vendor Outputs. The data in this new special file could then be summarized in the Stop_visits file. 6.9 Communicating and Using Results Using the Data Analysis tools, transit agencies can summarize KPIs across service days, periods, transit routes, and at transit stops. For each data aggregation point, transit agencies should determine a minimum reliable sampling threshold. For example, an AVL system may produce a
52 Improving Access and Management of Public Transit ITS Data reliable data sample at the trip level across a single month of sampling, but to produce a reliable result at the timepoint level, the sample must extend across a 3-month period. Clearly defining the sample period and number of observations required for each data type and each level of aggregation is critical to appropriate application of data in planning processes. The analysis produced from this specification should also clearly relay the sample period, the number of records sampled, and confidence levels. This will signal to those interpreting the data how reliable results are.
Abbreviations and acronyms used without denitions in TRB publications: A4A Airlines for America AAAE American Association of Airport Executives AASHO American Association of State Highway Officials AASHTO American Association of State Highway and Transportation Officials ACIâNA Airports Council InternationalâNorth America ACRP Airport Cooperative Research Program ADA Americans with Disabilities Act APTA American Public Transportation Association ASCE American Society of Civil Engineers ASME American Society of Mechanical Engineers ASTM American Society for Testing and Materials ATA American Trucking Associations CTAA Community Transportation Association of America CTBSSP Commercial Truck and Bus Safety Synthesis Program DHS Department of Homeland Security DOE Department of Energy EPA Environmental Protection Agency FAA Federal Aviation Administration FAST Fixing Americaâs Surface Transportation Act (2015) FHWA Federal Highway Administration FMCSA Federal Motor Carrier Safety Administration FRA Federal Railroad Administration FTA Federal Transit Administration GHSA Governors Highway Safety Association HMCRP Hazardous Materials Cooperative Research Program IEEE Institute of Electrical and Electronics Engineers ISTEA Intermodal Surface Transportation Efficiency Act of 1991 ITE Institute of Transportation Engineers MAP-21 Moving Ahead for Progress in the 21st Century Act (2012) NASA National Aeronautics and Space Administration NASAO National Association of State Aviation Officials NCFRP National Cooperative Freight Research Program NCHRP National Cooperative Highway Research Program NHTSA National Highway Traffic Safety Administration NTSB National Transportation Safety Board PHMSA Pipeline and Hazardous Materials Safety Administration RITA Research and Innovative Technology Administration SAE Society of Automotive Engineers SAFETEA-LU Safe, Accountable, Flexible, Efficient Transportation Equity Act: A Legacy for Users (2005) TCRP Transit Cooperative Research Program TEA-21 Transportation Equity Act for the 21st Century (1998) TRB Transportation Research Board TSA Transportation Security Administration U.S. DOT United States Department of Transportation
Transportation Research Board 500 Fifth Street, NW Washington, DC 20001 ADDRESS SERVICE REQUESTED ISBN 978-0-309-68745-4 9 7 8 0 3 0 9 6 8 7 4 5 4 9 0 0 0 0