Maintaining quality data

The Bus Open Data Service is committed to ensuring the most robust quality of data is available on BODS. As of such we take special care in ensuring that high data quality is maintained through various ways as outlined below.

Timetables data

BODS Compliant data

Currently for timetables data, BODS mandates that all publishers produce timetables data in TransXChange 2.4 v1.1A PTI profile. This profile is an extension of general schema and makes certain fields mandatory, whilst clarifying the specific use of some other fields in TransXChange. The full PTI profile can be found here:
https://pti.org.uk/system/files/files/TransXChange_UK_PTI_Profile_v1.1.A.pdf

The timetables data which is the above approved version is known as 'BODS Compliant data' and is flagged as such in different parts of the Find Bus Open Data Service (including Browse, Download all, Data Catalogue and the API).

Data Quality reports and scoring

Additionally, beyond the BODS Compliant tagging, BODS also does data quality checks which include key data-quality observations that would adversely impact the passenger experience, if unaddressed. The full list of checks can be found here.

BODS also generates a data quality score which can be seen as flagged in different parts of the Find Bus Open Data Service (including Browse, Changelog, Download all, Data Catalogue, and the API). The mechanism which the BODS algorithm uses to generate a data quality score can be found here.

Fares data

BODS will deliver new additions to the service to ensure publishers can perform data quality checks for fares data. The updates on this should be available to consumers via the Find Bus Open data site very soon. Please contact us for any further queries on this.

Bus location data

SIRI-VM data is taken into a central AVL system, where it is harmonised to produce a consistent SIRI-VM 2.0 output of bus location data for open data consumers.

We have introduced a SIRI-VM validator to BODS to ensure the highest data standards are provided to consumers. The validator has two parts: one that checks first for the schema and the second part checks for mandatory fields specified within the DfT BODS profile . For the schema check, if the feed fails it, the feed will be put in an ‘inactive’ status. The validator will check 250 packets from a feed each day.

Given the level of industry readiness in terms of providing consistent SIRI-VM data, there will be no blocking of feeds as long as they are valid SIRI (and don't fail the schema). However BODS compliance tags will be attached to showcase if they are: 'compliant', 'non-compliant' or 'partially compliant' using a 7-day rolling average. The validator will look at the last 7 days' worth of SIRI-VM aggregate data and assign a compliance status accordingly.

A SIRI-VM feed will be deemed 'compliant' if all fields here are present more than 70% of the time for the last 7 days.

  • Bearing
  • LineRef
  • OperatorRef
  • RecordedAtTime
  • ResponseTimestamp
  • VehicleJourneyRef
  • VehicleLocation (Lat, Long)
  • ProducerRef
  • DirectionRef
  • BlockRef
  • PublishedLineName
  • ValidUntilTime
  • DestinationRef
  • OriginName
  • OriginRef
  • VehicleRef

A SIRI-VM feed will be deemed 'partially compliant' if it has all other mandatory fields present but only have the following fields below missing 70% of the time in the last 7 days.

  • BlockRef
  • PublishedLineName
  • DestinationRef
  • OriginName
  • OriginRef

A SIRI-VM feed will be deemed 'non-compliant' if all fields below are not present more than 70% of the time for the last 7 days. It can also be assigned a direct non- compliant status if any one of the fields below fall under 45% population at the time of the daily validation check. This is because this would count as a gross error in the data and would be highlighted to the publisher right away.

  • Bearing
  • LineRef
  • OperatorRef
  • RecordedAtTime
  • ResponseTimestamp
  • VehicleJourneyRef
  • VehicleLocation (Lat, Long)
  • ProducerRef
  • DirectionRef
  • VehicleRef
  • ValidUntilTime

Other compliance statuses:

  • Undergoing validation: This status will be used for all newly added feeds in the first 24 hours until initial checks are completed. It will also be used for all compliant feeds for the first 7 days until the 'automated flow' rolling validation logic becomes active.
  • Awaiting publisher review: This status will be used for all feeds in the first 7 days after publishing if a critical or noncritical fields(s) has not been provided by >70% of vehicles in a daily check.
  • Unavailable due to dormant feed: This status will be used for all feeds which don’t have any vehicles running for 7 consecutive days and henceforth have repeatedly evaded validation.

New feed validation process:

When a new feed is added to BODS it will be validated in the following way:

  1. 24 hours after a new SIRI feed is added the validator will check against the mandatory fields and if necessary, an error report will be sent to operators.
  2. Over the subsequent 6 days when data is flowing through it will continue to run randomised daily checks.
  3. After Day 7: each day a fresh automated validation check will run and a compliance status will be assigned on a 7-day rolling average.

Automated feed validation process:

  1. The validator will run 1 randomised check per day (excluding buses running from 12am-5am).
  2. The validator will check 250 packets from a feed each day.
  3. 70% of vehicles on the feed need to be populating the mandatory fields to avoid moving in to non/partial compliance error status (e.g that means 70% of 'Bearing' should be present in the last 7 days' worth of data, if not, it will move to a non-compliant status).
  4. If the daily check has any non-compliant fields which are less than 45% populated (for each non-compliant feed), it will automatically move the compliance status to 'non-compliant' as it is a gross error.
  5. If the daily check has more than 45% of non-compliant fields populated (for each non-compliant feed), then the rolling average check will kick in and assign a compliance status based on the last 7 days.

Other development resources