The feed Module

Some tools for computing stats from a GTFS feed, assuming the feed is valid.

All time estimates below were produced on a 2013 MacBook Pro with a 2.8 GHz Intel Core i7 processor and 16GB of RAM running OS 10.9.2.

class gtfs_toolkit.feed.Feed(path)

Bases: builtins.object

A class to gather all the GTFS files for a feed and store them in memory as Pandas data frames. Make sure you have enough memory! The stop times object can be big.

dump_all_stats(directory, dates=None, freq='1H')

Into the given directory, dump to separate CSV files the outputs of

  • self.get_stops_stats(dates)
  • self.get_stops_time_series(dates)
  • trips_stats = self.get_trips_stats()
  • self.get_routes_stats(trips_stats, dates)
  • self.get_routes_time_series(dates)

where each time series is resampled to the given frequency. Also include a README.txt file that contains a few notes on units and include some useful charts.

If no dates are given, then use self.get_first_week()[:5].

NOTES:

Takes about 15 minutes on the SEQ feed.

get_dates()

Return a chronologically ordered list of dates (datetime.date objects) for which this feed is valid.

get_first_week()

Return a list of dates (datetime.date objects) of the first Monday–Sunday week for which this feed is valid. In the unlikely event that this feed does not cover a full Monday–Sunday week, then return whatever initial segment of the week it does cover.

get_linestring_by_shape()

Return a dictionary with structure shape_id -> Shapely linestring of shape in UTM coordinates. If self.shapes is None, then return None.

get_routes_stats(trips_stats, dates, split_directions=True)

Take trips_stats, which is the output of self.get_trips_stats(), and use it to calculate stats for all the routs in this feed averaged over the given dates (list of datetime.date objects).

Return a Pandas data frame with the following columns

  • route_id: route ID
  • mean_daily_num_trips
  • min_start_time: start time of the earliest active trip on the route
  • max_end_time: end time of latest active trip on the route
  • max_headway: maximum of the durations (in seconds) between trip starts on the route between 07:00 and 19:00 on the given dates
  • mean_headway: mean of the durations (in seconds) between trip starts on the route between 07:00 and 19:00 on the given dates
  • mean_daily_duration: in seconds
  • mean_daily_distance: in meters; contains all np.nan entries if self.shapes is None

If split_directions == True, then add an extra column

  • direction_id: 0 or 1,

and separate the stats above by the direction ID of the trips on each route.

NOTES:

Takes about 0.2 minute on the SEQ feed for 5 dates.

get_routes_time_series(trips_stats, dates)

Given trips_stats, which is the output of self.get_trips_stats(), use it to calculate the following four time series of routes stats:

  • mean daily number of vehicles in service by route ID
  • mean daily number of trip starts by route ID
  • mean daily service duration (seconds) by route ID
  • mean daily service distance (meters) by route ID

Each time series is a Pandas data frame over a 24-hour period with minute (period index) frequency (00:00 to 23:59).

Return the time series as values of a dictionary with keys ‘mean_daily_num_vehicles’, ‘mean_daily_num_trip_starts’, ‘mean_daily_duration’, ‘mean_daily_distance’.

NOTES:

  • To resample the resulting time series use the following methods:
    • for ‘mean_daily_num_vehicles’ series, use how=np.mean
    • for the other series, use how=np.sum
  • To remove the placeholder date (2001-1-1) and seconds from any of the time series f, do f.index = [t.time().strftime('%H:%M') for t in f.index.to_datetime()]

  • Takes about 1.5 minutes on the SEQ feed.

get_stations_stats(dates, split_directions=False)

Assuming this feed has station data, that is, ‘location_type’ and ‘parent_station’ columns in self.stops, then compute the same stats that self.get_stops_stats() does, but format_str stations.

get_stops_activity(dates)

Return a Pandas data frame with the columns

  • stop_id
  • dates[0]: a series of ones and zeros indicating if a

stop has stop times on this date (1) or not (0) ... - dates[-1]: ditto

If dates is None, then return None.

get_stops_in_stations()

Assuming this feed has station data, that is, ‘location_type’ and ‘parent_station’ columns in self.stops, then return a Pandas data frame that has the same columns as self.stops but only includes stops with parent stations, that is, stops with location type 0 or blank and nonblank parent station.

get_stops_stats(dates, split_directions=True)

Return a Pandas data frame with the following columns:

  • stop_id
  • mean_daily_num_vehicles: mean daily number of vehicles visiting stop
  • max_headway: maximum of the durations (in seconds) between vehicle departures at the stop between 07:00 and 19:00 on the given dates
  • mean_headway: mean of the durations (in seconds) between vehicle departures at the stop between 07:00 and 19:00 on the given dates
  • min_start_time: earliest departure time of a vehicle from this stop over the given date range
  • max_end_time: latest departure time of a vehicle from this stop over the given date range

If split_directions == True, then add an extra column

  • direction_id: 0 or 1,

and separate the stats above by the direction ID of the trips visiting each stop. So each stop_id will have two rows.

NOTES:

Takes about 0.9 minutes for the SEQ feed.

get_stops_time_series(dates)

Return the following time series of stops stats:

  • mean daily number of vehicles by stop ID

The time series is a Pandas data frame over a 24-hour period with minute (period index) frequency (00:00 to 23:59).

Return the time series as a value in a dictionary with key ‘mean_daily_num_vehicles’. (Outputing a dictionary of a time series instead of simply a time series matches the structure of get_routes_time_series() and allows for the possibility of adding other stops time series at a later stage of development.)

NOTES:

  • To resample the resulting time series use the following methods: for ‘mean_daily_num_vehicles’ series, use how=np.sum
  • To remove the placeholder date (2001-1-1) and seconds from any of the time series f, do f.index = [t.time().strftime('%H:%M') for t in f.index.to_datetime()]
  • Takes about 2 minutes on the SEQ feed.
get_trips_activity(dates)

Return a Pandas data frame with the columns

  • trip_id
  • route_id
  • direction_id
  • dates[0]: a series of ones and zeros indicating if a

trip is active (1) on the given date or inactive (0) ... - dates[-1]: ditto

If dates is None, then return None.

NOTES:

Takes about 0.15 minutes on the SEQ feed for 7 dates.

get_trips_stats()

Return a Pandas data frame with the following columns:

  • trip_id
  • direction_id
  • route_id
  • start_time: first departure time of the trip
  • end_time: last departure time of the trip
  • start_stop_id: stop ID of the first stop of the trip
  • end_stop_id: stop ID of the last stop of the trip
  • duration: duration of the trip (seconds)
  • distance: distance of the trip (meters); contains all np.nan entries if self.shapes is None

NOTES:

Takes about 2.4 minutes on the SEQ feed.

get_xy_by_stop()

Return a dictionary with structure stop_id -> stop location as a UTM coordinate pair

is_active_trip(trip, date)

If the given trip (trip ID) is active on the given date (date object), then return True. Otherwise, return False. To avoid error checking in the interest of speed, assume trip is a valid trip ID in the feed and date is a valid date object.

Previous topic

The utils Module

This Page