The feed Module

Some tools for computing stats from a GTFS feed, assuming the feed is valid.

All time estimates below were produced on a 2013 MacBook Pro with a 2.8 GHz Intel Core i7 processor and 16GB of RAM running OS 10.9.

TODO:

  • Possibly store dates as ‘%Y%m%d’ strings instead
  • Possibly scoop out main logic from Feed.get_stops_stats() and Feed.get_stops_time_series() and put it into top level functions for the sake of greater flexibility. Similar to what i did for Feed.get_routes_stats() and Feed.get_routes_time_series().
  • Speed up time series calculations
class gtfs_toolkit.feed.Feed(path, original_units='km')

Bases: builtins.object

A class to gather all the GTFS files for a feed and store them in memory as Pandas data frames. Make sure you have enough memory! The stop times object can be big.

add_dist_to_shapes()

Add/overwrite the optional shape_dist_traveled GTFS field for self.shapes.

NOTE:

Takes about 0.33 minutes on the Portland feed. All of the calculated shape_dist_traveled values for the Portland feed differ by at most 0.016 km in absolute values from of the original values.

add_dist_to_stop_times(trips_stats)

Add/overwrite the optional shape_dist_traveled GTFS field in self.stop_times.

Compute the shape_dist_traveled by using Shapely to measure the distance of a stop along its trip linestring. If for a given trip, this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows. Get the average speed of the trip via trips_stats and use is to linearly interpolate distances from stop times. This fallback method usually kicks in on trips with self-intersecting linestrings.

NOTE:

Takes about 0.75 minutes on the Portland feed. 98% of calculated ‘shape_dist_traveled’ values differ by at most 0.56 km in absolute value from the original values, and the maximum absolute difference is 6.3 km.

dump_all_stats(directory, date=None, freq='1H', split_directions=False)

Into the given directory, dump to separate CSV files the outputs of

  • self.get_stops_stats(date)
  • self.get_stops_time_series(date)
  • trips_stats = self.get_trips_stats()
  • self.get_routes_stats(trips_stats, date)
  • self.get_routes_time_series(date)

where each time series is resampled to the given frequency. Also include a README.txt file that contains a few notes on units and include some useful charts.

If no date is given, then use the first Monday of the feed.

get_active_stops(date, timestr=None)

Return the section of self.stops that contains only stops active on the given date (datetime.date object). If a time is given in the form of a GTFS time string ‘%H:%M:%S’, then return only those stops that have a departure time at that date and time. Do not take times modulo 24.

get_active_trips(date, timestr=None)

Return the section of self.trips that contains only trips active on the given date (datetime.date object). If a time is given in the form of a GTFS time string %H:%M:%S, then return only those trips active at that date and time. Do not take times modulo 24.

get_busiest_date_of_first_week()

Consider the dates in self.get_first_week() and return the first date that has the maximum number of active trips.

get_dates()

Return a chronologically ordered list of dates (datetime.date objects) for which this feed is valid.

get_first_week()

Return a list of dates (datetime.date objects) of the first Monday–Sunday week for which this feed is valid. In the unlikely event that this feed does not cover a full Monday–Sunday week, then return whatever initial segment of the week it does cover.

get_linestring_by_shape(use_utm=True)

Return a dictionary with structure shape_id -> Shapely linestring of shape. If self.shapes is None, then return None. If use_utm == True, then return each linestring in in UTM coordinates. Otherwise, return each linestring in WGS84 longitude-latitude coordinates.

get_point_by_stop(use_utm=True)

Return a dictionary with structure stop_id -> Shapely point object. If use_utm == True, then return each point in in UTM coordinates. Otherwise, return each point in WGS84 longitude-latitude coordinates.

get_routes_stats(trips_stats, date, split_directions=False, headway_start_timestr='07:00:00', headway_end_timestr='19:00:00')

Take trips_stats, which is the output of self.get_trips_stats(), cut it down to the subset S of trips that are active on the given date, and then call get_routes_stats() with S and the keyword arguments split_directions, headway_start_timestr, and headway_end_timestr.

See get_routes_stats() for a description of the output.

NOTES:

A more user-friendly version of get_routes_stats(). The latter function works without a feed, though. Takes about 0.2 minutes on the Portland feed.

get_routes_time_series(trips_stats, date, split_directions=False, freq='5Min')

Take trips_stats, which is the output of self.get_trips_stats(), cut it down to the subset S of trips that are active on the given date, and then call self.get_routes_time_series_0() with S and the given keyword arguments split_directions and freq and with date_label = utils.date_to_str(date).

See get_routes_stats() for a description of the output.

NOTES:

A more user-friendly version of get_routes_time_series(). The latter function works without a feed, though. Takes about 0.6 minutes on the Portland feed.

get_stations_stats(date, split_directions=False, headway_start_timestr='07:00:00', headway_end_timestr='19:00:00')

If this feed has station data, that is, ‘location_type’ and ‘parent_station’ columns in self.stops, then compute the same stats that self.get_stops_stats() does, but for stations. Otherwise, return None.

NOTES:

Takes about 0.2 minutes on the Portland feed given the first five weekdays of the feed.

get_stops_activity(dates)

Return a Pandas data frame with the columns

  • stop_id
  • dates[0]: a series of ones and zeros indicating if a stop has stop times on this date (1) or not (0)
  • etc.
  • dates[-1]: ditto

If dates is None, then return None.

get_stops_in_stations()

If this feed has station data, that is, ‘location_type’ and ‘parent_station’ columns in self.stops, then return a Pandas data frame that has the same columns as self.stops but only includes stops with parent stations, that is, stops with location type 0 or blank and nonblank parent station. Otherwise, return None.

get_stops_stats(date, split_directions=False, headway_start_timestr='07:00:00', headway_end_timestr='19:00:00')

Return a Pandas data frame with the following columns:

  • stop_id
  • direction_id
  • num_vehicles: number of vehicles visiting stop
  • max_headway: durations (in minuts) between vehicle departures at the stop between headway_start_timestr and headway_end_timestr on the given date
  • mean_headway: durations (in minutes) between vehicle departures at the stop between headway_start_timestr and headway_end_timestr on the given date
  • start_time: earliest departure time of a vehicle from this stop on the given date
  • end_time: latest departure time of a vehicle from this stop on the given date

If split_directions == False, then compute each stop’s stats using vehicles visiting it from both directions.

NOTES:

Takes about 0.7 minutes on the Portland feed.

get_stops_time_series(date, split_directions=False, freq='5Min')

Return a time series version of the following stops stats for the given date:

  • number of vehicles by stop ID

The time series is a Pandas data frame with a timestamp index for the 24-hour period on the given date sampled at the given frequency. The maximum allowable frequency is 1 minute.

Using a period index instead of a timestamp index would be more apppropriate, but Pandas 0.14.1 doesn’t support period index frequencies at multiples of DateOffsets (e.g. ‘5Min’).

The columns of the data frame are hierarchical (multi-index) with

  • top level: name = ‘statistic’, values = [‘num_vehicles’]
  • middle level: name = ‘stop_id’, values = the active stop IDs
  • bottom level: name = ‘direction_id’, values = 0s and 1s

If split_directions == False, then don’t include the bottom level.

NOTES:

  • ‘num_vehicles’ should be resampled with how=np.sum
  • To remove the date and seconds from the time series f, do f.index = [t.time().strftime('%H:%M') for t in f.index.to_datetime()]
  • Takes about 6.15 minutes on the Portland feed given the first five weekdays of the feed.
get_trips_activity(dates)

Return a Pandas data frame with the columns

  • trip_id
  • route_id
  • direction_id
  • dates[0]: a series of ones and zeros indicating if a trip is active (1) on the given date or inactive (0)
  • etc.
  • dates[-1]: ditto

If dates is None, then return None.

get_trips_stats(get_dist_from_shapes=False)

Return a Pandas data frame with the following columns:

  • trip_id
  • direction_id
  • route_id
  • shape_id
  • start_time: first departure time of the trip
  • end_time: last departure time of the trip
  • duration: duration of the trip in hours
  • start_stop_id: stop ID of the first stop of the trip
  • end_stop_id: stop ID of the last stop of the trip
  • num_stops: number of stops on trip
  • distance: distance of the trip in kilometers; contains all np.nan entries if self.shapes is None

NOTES:

If self.stop_times has a shape_dist_traveled column and get_dist_from_shapes == False, then use that column to compute the distance column (in km). Elif self.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to np.nan.

Takes about 0.3 minutes on the Portland feed, which has the shape_dist_traveled column. Using get_dist_from_shapes=True on the Portland feed, yields a maximum absolute difference of 0.75 km from using get_dist_from_shapes=True.

get_vehicles_locations(linestring_by_shape, date, timestrs)

Return a Pandas data frame of the positions of all trips active on the given date and times. Include the columns:

  • trip_id
  • direction_id
  • route_id
  • time
  • rel_dist: number between 0 (start) and 1 (end) indicating the relative distance of the vehicle along its path
  • lon: longitude of vehicle at given time
  • lat: latitude of vehicle at given time

Requires input self.get_linestring_from_shape(use_utm=False). Assume self.stop_times has a shape_dist_traveled column, possibly created by add_dist_to_stop_times().

NOTES:

On the Portland feed, can do 24*60 timestrings (minute frequency) in 0.4 min.

is_active_trip(trip, date)

If the given trip (trip ID) is active on the given date (date object), then return True. Otherwise, return False. To avoid error checking in the interest of speed, assume trip is a valid trip ID in the feed and date is a valid date object.

gtfs_toolkit.feed.agg_routes_stats(routes_stats)

Given route_stats which is the output of get_routes_stats(), return a Pandas data frame with the following columns:

  • direction_id
  • num_trips: the sum of the corresponding column in the input across all routes
  • start_time: the minimum of the corresponding column of the input across all routes
  • end_time: the maximum of the corresponding column of the input across all routes
  • service_duration: the sum of the corresponding column in the input across all routes
  • service_distance: the sum of the corresponding column in the input across all routes
  • service_speed: service_distance/service_distance

If the input has no direction id, then the output won’t.

gtfs_toolkit.feed.agg_routes_time_series(routes_time_series)
gtfs_toolkit.feed.combine_time_series(time_series_dict, kind, split_directions=False)

Given a dictionary of time series data frames, combine the time series into one time series data frame with multi-index (hierarchical) columns and return the result. The top level columns are the keys of the dictionary and the second and third level columns are ‘route_id’ and ‘direction_id’, if kind == 'route', or ‘stop_id’ and ‘direction_id’, if kind == 'stop'. If split_directions == False, then there is no third column level, no ‘direction_id’ column.

gtfs_toolkit.feed.downsample(time_series, freq)

Downsample the given route or stop time series, which is the output of Feed.get_routes_time_series() or Feed.get_stops_time_series(), to the given Pandas-style frequency. Can’t downsample to frequencies less one minute (‘1Min’), because the time series are generated with one-minute frequency.

gtfs_toolkit.feed.get_routes_stats(trips_stats_subset, split_directions=False, headway_start_timestr='07:00:00', headway_end_timestr='19:00:00')

Given a subset of the output of Feed.get_trips_stats(), calculate stats for the routes in that subset.

Return a Pandas data frame with the following columns:

  • route_id
  • direction_id
  • num_trips: mean daily number of trips
  • start_time: start time of the earliest active trip on the route
  • end_time: end time of latest active trip on the route
  • max_headway: maximum of the durations (in minutes) between trip starts on the route between headway_start_timestr and headway_end_timestr on the given dates
  • mean_headway: mean of the durations (in minutes) between trip starts on the route between headway_start_timestr and headway_end_timestr on the given dates
  • service_duration: total of the duration of each trip on the route in the given subset of trips; measured in hours
  • service_distance: total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers; contains all np.nan entries if self.shapes is None
  • service_speed: service_distance/service_duration; measured in kilometers per hour

If split_directions == False, then remove the direction_id column and compute each route’s stats, except for headways, using its trips running in both directions. In this case, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

NOTES:

Takes about 0.2 minutes on the Portland feed given the first five weekdays of the feed.

gtfs_toolkit.feed.get_routes_time_series(trips_stats_subset, split_directions=False, freq='5Min', date_label='2001-01-01')

Given a subset of the output of Feed.get_trips_stats(), calculate time series for the routes in that subset.

Return a time series version of the following route stats:

  • number of vehicles in service by route ID
  • number of trip starts by route ID
  • service duration in hours by route ID
  • service distance in kilometers by route ID
  • service speed in kilometers per hour

The time series is a Pandas data frame with a timestamp index for a 24-hour period sampled at the given frequency. The maximum allowable frequency is 1 minute. date_label is used as the date for the timestamp index.

Using a period index instead of a timestamp index would be more apppropriate, but Pandas 0.14.1 doesn’t support period index frequencies at multiples of DateOffsets (e.g. ‘5Min’).

The columns of the data frame are hierarchical (multi-index) with

  • top level: name = ‘statistic’, values = [‘service_distance’, ‘service_duration’, ‘num_trip_starts’, ‘num_vehicles’, ‘service_speed’]
  • middle level: name = ‘route_id’, values = the active routes
  • bottom level: name = ‘direction_id’, values = 0s and 1s

If split_directions == False, then don’t include the bottom level.

NOTES:

  • To resample the resulting time series use the following methods:
    • for ‘num_vehicles’ series, use how=np.mean
    • for the other series, use how=np.sum
    • ‘service_speed’ can’t be resampled and must be recalculated from ‘service_distance’ and ‘service_duration’
  • To remove the date and seconds from the time series f, do f.index = [t.time().strftime('%H:%M') for t in f.index.to_datetime()]

  • Takes about 0.6 minutes on the Portland feed given the first five weekdays of the feed.

gtfs_toolkit.feed.plot_headways(stats, max_headway_limit=60)

Given a stops or routes stats data frame, return bar charts of the max and mean headways as a MatplotLib figure. Only include the stops/routes with max headways at most max_headway_limit minutes. If max_headway_limit is None, then include them all in a giant plot. If there are no stops/routes within the max headway limit, then return None.

NOTES:

Take the resulting figure f and do f.tight_layout() for a nice-looking plot.

gtfs_toolkit.feed.plot_routes_time_series(routes_time_series)

Given a routes time series data frame, sum each time series statistic over all routes, plot each series statistic using MatplotLib, and return the resulting figure of subplots.

NOTES:

Take the resulting figure f and do f.tight_layout() for a nice-looking plot.

Previous topic

The utils Module

This Page