Ride Aggregation Efficiency
Alan Parker Lue
26 February 2018
Abstract. I propose a method to quantify the notion of efficiency when aggregating rides in ridesharing settings, where passengers on two or more distinct trips may occupy the same vehicle at the same time. Given a set of trips that overlap in time but with potentially different origins and destinations, a measure of ride aggregation efficiency can help ridesharing system operators decide how best to allocate riders to vehicles.
In the course of fulfilling ride requests in a ridesharing system, riders can often be efficiently allocated to vehicles by identifying the sets of rides with the greatest space–time overlap. Rides with high overlap have high ride aggregation efficiency, meaning that such rides tend to travel in the same direction and occur at similar times. In general there are two approaches:
- Ex ante heuristics
- Ex post measurement
Ex post measurement (i.e., of actual or simulated aggregated ride outcomes) simultaneously measures both pure ride aggregation efficiency and the effectiveness of the ride routing algorithm itself—e.g., the rides in a zone may be intrinsically highly aggregable, but an ineffective routing algorithm could lead to very low aggregation. If we assume that the algorithm is efficient, then we could consider the ratio of passenger-minutes to driver-minutes:
\begin{equation} f(t_0, T) = \frac{\sum_{i=1}^n \int_{t_0}^T p_i(t)\,dt}{\sum_{i=1}^n \int_{t_0}^T 𝟙_i(t)\,dt}, \end{equation}where \( p_i(t) \) is the number of passengers in vehicle \( i \) at time \( t \), \( 𝟙_i(t) \) is an indicator variable for whether vehicle \( i \) is driving a passenger at time \( t \), and \( n = n(t_0, T) \) is the number of vehicles operating during the interval \( (t_0, T) \). This ratio will be \( 1 \) for completely independent trips and a large positive number when many rides aggregate into few vehicles, so the higher the better. We take care to avoid assigning high efficiency to perverse outcomes (e.g., where a rider suffers from a long detour) by assuming that ride‐routing algorithms are efficient.
In the absence of a particular ride‐routing algorithm and by utilizing publicly available data on unaggregated trips, I propose a solution to measure ride aggregation efficiency using an ex ante heuristic.
1 Ex ante heuristic for ride aggregation efficiency
I propose a ride aggregation efficiency metric that yields a nonnegative number whose magnitude summarizes how many times a trip can be grouped with other trips, accounting for trip duration. For example,
- a trip whose aggregation efficiency is 0 is unaggregable, whereas
- a trip whose aggregation efficiency is 4 can group its trajectory with nearby trips four times over.
For zones, aggregation efficiency in a given time window is the average of the aggregation efficiencies of the trips within the zone and time window.
I measure ride aggregation efficiency by considering fields of aggregation compatibility around the trajectories of past taxi trips and by quantifying the amount of overlap in those fields. The core idea is that if there is a large amount of overlap between the aggregation compatibility fields of two taxi trips, then the trips can be easily aggregated.
I make the following assumptions:
- Vehicles travel at constant velocity in straight-line trajectories from their origins to their destinations.
- Aggregation compatibility is a decreasing function of straight-line distance between two vehicles.
Based on these assumptions, I posit the following model and efficiency metrics:
- A vehicle's aggregation compatibility field at each point in time can be represented by a bivariate Gaussian distribution.
- Over the course of their trajectories, the amount of overlap between the fields of two vehicles measures their mutual aggregation compatibility.
- I calculate the aggregation compatibility of two vehicle trajectories (i.e., trips) by integrating the overlap of their aggregation compatibility fields across space and time.
- I calculate the trip aggregation efficiency by (1) summing the pairwise aggregation compatibilities of that trip with all other trips in a given time window and zone and (2) dividing by the duration of the trip.
- I calculate the zone aggregation efficiency by taking the mean of the zone's trip aggregation efficiencies in a given time window.
The core calculation is the integration of aggregation compatibility overlap across space and time:
\begin{equation} \int_{t_0}^T \int_{x_0}^X \int_{y_0}^Y 𝟙(x, y; \mu_1(t), n\Sigma) 𝟙(x, y; \mu_2(t), n\Sigma) \frac{c}{\sqrt{(2\pi)^2 |\Sigma|}} f(x) \,dy\,dx\,dt, \textrm{and} \end{equation} \begin{equation} f(x) = \min{\left(\exp{\left\{-\frac{1}{2}[(x, y) - \mu_1(t)]^T \Sigma^{-1}[(x, y) - \mu_1(t)]\right\}}, \exp{\left\{-\frac{1}{2}[(x, y) - \mu_2(t)]^T \Sigma^{-1}[(x, y) - \mu_2(t)]\right\}}\right)}, \end{equation}where \( t \) is time, \( x \) is longitude, \( y \) is latitude, \( μ_i(t) \) is the coordinates of the vehicle \( i \) at time \( t \), \( \Sigma \) is the covariance matrix for aggregation compatibility, \( n \) is the number of standard deviations the limit of truncation is away from the mean, and \( c \) is the normalizing constant for the truncated bivariate normal distribution.
In order to operationalize this model, I implement the following features:
- Truncated bivariate normal distribution to represent aggregation
compatibility (see class
TruncBiNorm
andTruncBiNormOverlap
intaxi.py
) - Use a trip's pick-up time, origin, and destination to instantiate an
trip object with corresponding trajectory and aggregation
compatibility field (see class
Trip
intaxi.py
)
This model has three parameters that impact the efficiency calculation for a given set of taxi trip data:
- Vehicle velocity. Assume 15 km/h (about 9.3 mph), or 250 m/min.
- 0.00225 lat/min
- 0.00296 lon/min
- Standard deviation of aggregation compatibility. Assume 300 m
(just over 1 Manhattan avenue).
- 0.00270 lat
- 0.00355 lon
- Limits of bivariate Gaussian truncation. Assume 3 standard deviations.
1.1 Example: (Roughly) overlapping rides
COLNAMES = [ 'pick_time', 'drop_time', 'pick_lat', 'pick_lon', 'drop_lat', 'drop_lon', 'n_passengers', ]
Consider two rides in Astoria, NY.
trips = pd.DataFrame( [ (pd.Timestamp('2016-06-15 12:00:00'), np.nan, 40.7759, -73.9276, 40.7665, -73.9217, 1), (pd.Timestamp('2016-06-15 12:00:03'), np.nan, 40.7757, -73.9269, 40.7673, -73.9209, 1), ], columns=COLNAMES, ) output_notebook() taxi.plot_zones_and_trips(['Astoria'], trips)
These two rides have high mutual aggregation efficiencies, as they are close in both space and time.
zae = taxi.zone_aggeffs(trips) print('Trip aggregation efficiencies:', zae) print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.81878319 0.8921752 ] Mean efficiency for zone: 0.8554791973249489
However if I start the second one three minutes later, their aggregation efficiencies fall.
trips = pd.DataFrame( [ (pd.Timestamp('2016-06-15 12:00:00'), np.nan, 40.7759, -73.9276, 40.7665, -73.9217, 1), (pd.Timestamp('2016-06-15 12:03:03'), np.nan, 40.7757, -73.9269, 40.7673, -73.9209, 1), ], columns=COLNAMES, ) zae = taxi.zone_aggeffs(trips) print('Trip aggregation efficiencies:', zae) print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.07696682 0.08386577] Mean efficiency for zone: 0.08041629551941824
1.2 Example: Multiple partially overlapping rides
For another example, the aggregation efficiency of the second of the following three trips reflects its space–time overlap with the first and third rides.
trips = pd.DataFrame( [ (pd.Timestamp('2016-06-15 12:00:00'), np.nan, 40.7759, -73.9276, 40.7665, -73.9217, 1), (pd.Timestamp('2016-06-15 12:01:00'), np.nan, 40.7710, -73.9240, 40.7630, -73.9190, 1), (pd.Timestamp('2016-06-15 12:02:00'), np.nan, 40.7651, -73.9200, 40.7608, -73.9170, 1), ], columns=COLNAMES, ) output_notebook() taxi.plot_zones_and_trips(['Astoria'], trips)
zae = taxi.zone_aggeffs(trips) print('Trip aggregation efficiencies:', zae) print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.48252141 0.71270788 0.55640971] Mean efficiency for zone: 0.583879668259483
1.3 Example: Unaggregable rides
Finally, unaggregable rides do not impact the aggregation efficiencies of other trips, but they do reduce the zone-level aggregation efficiency.
trips = pd.DataFrame( [ (pd.Timestamp('2016-06-15 12:00:00'), np.nan, 40.7759, -73.9276, 40.7665, -73.9217, 1), (pd.Timestamp('2016-06-15 12:01:00'), np.nan, 40.7710, -73.9240, 40.7630, -73.9190, 1), (pd.Timestamp('2016-06-15 12:02:00'), np.nan, 40.7651, -73.9200, 40.7608, -73.9170, 1), (pd.Timestamp('2016-06-15 12:02:00'), np.nan, 40.7670, -73.9100, 40.7600, -73.9100, 1), ], columns=COLNAMES, ) output_notebook() taxi.plot_zones_and_trips(['Astoria'], trips)
zae = taxi.zone_aggeffs(trips) print('Trip aggregation efficiencies:', zae) print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.48252141 0.71270788 0.55640971 0. ] Mean efficiency for zone: 0.4379097511946123
2 Simulation and analysis
I calculate aggregation efficiencies for individual trips with various zones and time windows. I take Midtown to represent Manhattan and examine the following zone combinations:
- Astoria
- Astoria–Midtown
- Astoria–Midtown–LGA
- UES
- UES–Midtown
I perform the calculation over six different times of the week:
- Weekday morning (Wednesday starting at 08:00)
- Weekday afternoon (Wednesday starting at 12:00)
- Weekday evening (Wednesday starting at 18:00)
- Weekend morning (Saturday starting at 08:00)
- Weekend afternoon (Saturday starting at 12:00)
- Weekend evening (Saturday starting at 18:00)
Because the aggregation efficiency calculation is computationally intensive, I use various windows of time ranging from 1–120 minutes to limit the number of trips to a reasonable level. For each time window with more than 30 trips, I randomly sample just 30 trips to perform the aggregation efficiency calculation. The table below shows the zone‐level aggregation efficiencies.
# Load aggregation efficiency calculation layout ae_layout = pd.read_csv('aggeff_layout.csv') # Load trip data and aggregation efficiency results trips, zae = [], [] for i in ae_layout.index: with open('results/trips_{}.pkd'.format(i), 'rb') as f: trips.append(pickle.load(f)) with open('results/zae_{}.pkd'.format(i), 'rb') as f: zae.append(pickle.load(f))
pd.concat( [ ae_layout[['zone_label', 'time_label', 'start_time', 'duration']], pd.DataFrame( [(len(td), np.mean(ae)) for td, ae in zip(trips, zae)], columns=['n_trips', 'efficiency'], ), ], axis=1, ).sort_values(['zone_label', 'start_time'])
№ | zone_label |
time_label |
start_time |
duration |
n_trips |
efficiency |
---|---|---|---|---|---|---|
0 | Astoria | Weekday morning | 2016-06-15 08:00 | 60 | 34 | 0.081083 |
5 | Astoria | Weekday afternoon | 2016-06-15 12:00 | 60 | 34 | 0.127128 |
10 | Astoria | Weekday evening | 2016-06-15 18:00 | 60 | 58 | 0.180803 |
15 | Astoria | Weekend morning | 2016-06-18 08:00 | 60 | 19 | 0.063788 |
20 | Astoria | Weekend afternoon | 2016-06-18 12:00 | 60 | 40 | 0.116931 |
25 | Astoria | Weekend evening | 2016-06-18 18:00 | 60 | 59 | 0.252794 |
1 | Astoria-Midtown | Weekday morning | 2016-06-15 08:00 | 90 | 11 | 0.368971 |
6 | Astoria-Midtown | Weekday afternoon | 2016-06-15 12:00 | 120 | 8 | 0.106940 |
11 | Astoria-Midtown | Weekday evening | 2016-06-15 18:00 | 60 | 11 | 0.473867 |
16 | Astoria-Midtown | Weekend morning | 2016-06-18 08:00 | 60 | 15 | 0.275607 |
21 | Astoria-Midtown | Weekend afternoon | 2016-06-18 12:00 | 60 | 13 | 0.124903 |
26 | Astoria-Midtown | Weekend evening | 2016-06-18 18:00 | 60 | 21 | 0.212501 |
2 | Astoria-Midtown-LGA | Weekday morning | 2016-06-15 08:00 | 3 | 17 | 3.110343 |
7 | Astoria-Midtown-LGA | Weekday afternoon | 2016-06-15 12:00 | 5 | 21 | 2.762013 |
12 | Astoria-Midtown-LGA | Weekday evening | 2016-06-15 18:00 | 5 | 14 | 1.312172 |
17 | Astoria-Midtown-LGA | Weekend morning | 2016-06-18 08:00 | 10 | 23 | 0.999410 |
22 | Astoria-Midtown-LGA | Weekend afternoon | 2016-06-18 12:00 | 10 | 27 | 1.846878 |
27 | Astoria-Midtown-LGA | Weekend evening | 2016-06-18 18:00 | 15 | 14 | 0.339333 |
3 | UES | Weekday morning | 2016-06-15 08:00 | 5 | 115 | 2.966929 |
8 | UES | Weekday afternoon | 2016-06-15 12:00 | 5 | 122 | 2.983721 |
13 | UES | Weekday evening | 2016-06-15 18:00 | 5 | 125 | 3.674477 |
18 | UES | Weekend morning | 2016-06-18 08:00 | 5 | 34 | 1.031728 |
23 | UES | Weekend afternoon | 2016-06-18 12:00 | 5 | 68 | 1.804801 |
28 | UES | Weekend evening | 2016-06-18 18:00 | 5 | 74 | 2.187959 |
4 | UES-Midtown | Weekday morning | 2016-06-15 08:00 | 1 | 35 | 3.023929 |
9 | UES-Midtown | Weekday afternoon | 2016-06-15 12:00 | 2 | 42 | 2.836140 |
14 | UES-Midtown | Weekday evening | 2016-06-15 18:00 | 1 | 25 | 1.526476 |
19 | UES-Midtown | Weekend morning | 2016-06-18 08:00 | 2 | 12 | 0.928378 |
24 | UES-Midtown | Weekend afternoon | 2016-06-18 12:00 | 2 | 43 | 2.458989 |
29 | UES-Midtown | Weekend evening | 2016-06-18 18:00 | 2 | 30 | 1.305169 |
2.1 Assess efficiency of aggregating rides within Astoria
After calculating the aggregation efficiencies for the various zone–time combinations, I pool the trip‐by‐trip aggregation efficiencies by zone to compare the overall efficiency quantiles and sample statistics. The numbers below reflect individual trip aggregation efficiencies across weekday‐weekend and morning‐afternoon‐evening combinations.
def aggeff_stats_by_zone(zones): d = { z: np.concatenate([zae[i] for i in ae_layout.index[ae_layout['zone_label'] == z]]) for z in zones } return pd.DataFrame([ pd.concat([pd.Series([z], index=['zone']), pd.Series(d[z]).describe()]) for z in zones ]) aggeff_stats_by_zone( ['Astoria', 'Astoria-Midtown', 'Astoria-Midtown-LGA', 'UES', 'UES-Midtown'] )
№ | zone | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|---|
0 | Astoria | 169.0 | 0.141859 | 0.248959 | 0.0 | 0.000000 | 0.003992 | 0.211858 | 1.200064 |
1 | Astoria-Midtown | 79.0 | 0.257558 | 0.276845 | 0.0 | 0.032165 | 0.166450 | 0.420903 | 1.255381 |
2 | Astoria-Midtown-LGA | 116.0 | 1.783201 | 1.346923 | 0.0 | 0.696486 | 1.429303 | 2.829936 | 5.210652 |
3 | UES | 180.0 | 2.441603 | 1.805567 | 0.0 | 1.003623 | 2.101666 | 3.500081 | 10.291479 |
4 | UES-Midtown | 157.0 | 2.153052 | 1.526896 | 0.0 | 0.998387 | 2.040324 | 3.093651 | 6.243472 |
taxi.plot_zones_and_trips( ['Astoria'], pd.concat([trips[i] for i in ae_layout.index[ae_layout['zone_label'] == 'Astoria']]), )
2.2 Comparing zones
The preceding table shows that ride aggregation is a much more efficient process in the Upper East Side (UES) than in Astoria, about
\begin{equation*} \frac{1 + 2.44}{1 + 0.14} = 3.02 \end{equation*}times more efficient when comparing means.
A histogram shows that ride aggregation efficiency has a much greater range in UES.
ae_astoria = np.concatenate( [zae[i] for i in ae_layout.index[ae_layout['zone_label'] == 'Astoria']] ) ae_ues = np.concatenate( [zae[i] for i in ae_layout.index[ae_layout['zone_label'] == 'UES']] ) plt.hist(ae_astoria, alpha=0.5, label='Astoria') plt.hist(ae_ues, alpha=0.5, label='UES') plt.legend(loc='upper left') plt.show()
taxi.plot_zones_and_trips( ['Upper East Side', taxi.get_manhattan_neighborhoods()], pd.concat([trips[i] for i in ae_layout.index[ae_layout['zone_label'] == 'UES']]), )
2.3 Scope of service
Ride aggregation efficiency can function as a decision variable for determining scope of service. For example, ride aggregation efficiency is generally higher between Astoria and Manhattan than within Astoria, suggesting that interzone service could be feasible in this case.
taxi.plot_zones_and_trips( ['Astoria', 'Midtown', taxi.get_manhattan_neighborhoods()], pd.concat([trips[i] for i in ae_layout.index[ae_layout['zone_label'] == 'Astoria-Midtown']]), )
2.4 Service scheduling
The following table of statistics suggests that ride aggregation efficiencies in the area of study are comparable between weekdays and weekends and moreover that efficiencies tend to increase over the course of the day. Since optimal service scheduling depends more heavily on ride demand than on ride aggregability, a scheduling solution would most likely involve modulating driver supply according diurnal demand variation.
def filter_stats_by_time(times, zone): d = { t: np.concatenate([ zae[i] for i in ae_layout.index[ (ae_layout['time_label'] == t) & (ae_layout['zone_label'] == zone) ] ]) for t in times } return pd.DataFrame([ pd.concat([pd.Series([t], index=['time']), pd.Series(d[t]).describe()]) for t in times ]) filter_stats_by_time( [ 'Weekday morning', 'Weekday afternoon', 'Weekday evening', 'Weekend morning', 'Weekend afternoon', 'Weekend evening', ], 'Astoria', )
№ | time | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|---|
0 | Weekday morning | 30.0 | 0.081083 | 0.175967 | 0.0 | 0.000000 | 0.000000 | 0.022029 | 0.692394 |
1 | Weekday afternoon | 30.0 | 0.127128 | 0.261474 | 0.0 | 0.000000 | 0.000019 | 0.153857 | 0.989229 |
2 | Weekday evening | 30.0 | 0.180803 | 0.205144 | 0.0 | 0.000000 | 0.139620 | 0.292863 | 0.781462 |
3 | Weekend morning | 19.0 | 0.063788 | 0.159064 | 0.0 | 0.000000 | 0.000000 | 0.023260 | 0.551952 |
4 | Weekend afternoon | 30.0 | 0.116931 | 0.231197 | 0.0 | 0.000000 | 0.005462 | 0.064922 | 0.882757 |
5 | Weekend evening | 30.0 | 0.252794 | 0.354266 | 0.0 | 0.005164 | 0.053167 | 0.440431 | 1.200064 |