Ride Aggregation Efficiency

Alan Parker Lue

26 February 2018

Abstract. I propose a method to quantify the notion of efficiency when aggregating rides in ridesharing settings, where passengers on two or more distinct trips may occupy the same vehicle at the same time. Given a set of trips that overlap in time but with potentially different origins and destinations, a measure of ride aggregation efficiency can help ridesharing system operators decide how best to allocate riders to vehicles.

In the course of fulfilling ride requests in a ridesharing system, riders can often be efficiently allocated to vehicles by identifying the sets of rides with the greatest space–time overlap. Rides with high overlap have high ride aggregation efficiency, meaning that such rides tend to travel in the same direction and occur at similar times. In general there are two approaches:

  1. Ex ante heuristics
  2. Ex post measurement

Ex post measurement (i.e., of actual or simulated aggregated ride outcomes) simultaneously measures both pure ride aggregation efficiency and the effectiveness of the ride routing algorithm itself—e.g., the rides in a zone may be intrinsically highly aggregable, but an ineffective routing algorithm could lead to very low aggregation. If we assume that the algorithm is efficient, then we could consider the ratio of passenger-minutes to driver-minutes:

\begin{equation} f(t_0, T) = \frac{\sum_{i=1}^n \int_{t_0}^T p_i(t)\,dt}{\sum_{i=1}^n \int_{t_0}^T 𝟙_i(t)\,dt}, \end{equation}

where \( p_i(t) \) is the number of passengers in vehicle \( i \) at time \( t \), \( 𝟙_i(t) \) is an indicator variable for whether vehicle \( i \) is driving a passenger at time \( t \), and \( n = n(t_0, T) \) is the number of vehicles operating during the interval \( (t_0, T) \). This ratio will be \( 1 \) for completely independent trips and a large positive number when many rides aggregate into few vehicles, so the higher the better. We take care to avoid assigning high efficiency to perverse outcomes (e.g., where a rider suffers from a long detour) by assuming that ride‐routing algorithms are efficient.

In the absence of a particular ride‐routing algorithm and by utilizing publicly available data on unaggregated trips, I propose a solution to measure ride aggregation efficiency using an ex ante heuristic.

1 Ex ante heuristic for ride aggregation efficiency

I propose a ride aggregation efficiency metric that yields a nonnegative number whose magnitude summarizes how many times a trip can be grouped with other trips, accounting for trip duration. For example,

  • a trip whose aggregation efficiency is 0 is unaggregable, whereas
  • a trip whose aggregation efficiency is 4 can group its trajectory with nearby trips four times over.

For zones, aggregation efficiency in a given time window is the average of the aggregation efficiencies of the trips within the zone and time window.

I measure ride aggregation efficiency by considering fields of aggregation compatibility around the trajectories of past taxi trips and by quantifying the amount of overlap in those fields. The core idea is that if there is a large amount of overlap between the aggregation compatibility fields of two taxi trips, then the trips can be easily aggregated.

I make the following assumptions:

  • Vehicles travel at constant velocity in straight-line trajectories from their origins to their destinations.
  • Aggregation compatibility is a decreasing function of straight-line distance between two vehicles.

Based on these assumptions, I posit the following model and efficiency metrics:

  1. A vehicle's aggregation compatibility field at each point in time can be represented by a bivariate Gaussian distribution.
  2. Over the course of their trajectories, the amount of overlap between the fields of two vehicles measures their mutual aggregation compatibility.
  3. I calculate the aggregation compatibility of two vehicle trajectories (i.e., trips) by integrating the overlap of their aggregation compatibility fields across space and time.
  4. I calculate the trip aggregation efficiency by (1) summing the pairwise aggregation compatibilities of that trip with all other trips in a given time window and zone and (2) dividing by the duration of the trip.
  5. I calculate the zone aggregation efficiency by taking the mean of the zone's trip aggregation efficiencies in a given time window.

The core calculation is the integration of aggregation compatibility overlap across space and time:

\begin{equation} \int_{t_0}^T \int_{x_0}^X \int_{y_0}^Y 𝟙(x, y; \mu_1(t), n\Sigma) 𝟙(x, y; \mu_2(t), n\Sigma) \frac{c}{\sqrt{(2\pi)^2 |\Sigma|}} f(x) \,dy\,dx\,dt, \textrm{and} \end{equation} \begin{equation} f(x) = \min{\left(\exp{\left\{-\frac{1}{2}[(x, y) - \mu_1(t)]^T \Sigma^{-1}[(x, y) - \mu_1(t)]\right\}}, \exp{\left\{-\frac{1}{2}[(x, y) - \mu_2(t)]^T \Sigma^{-1}[(x, y) - \mu_2(t)]\right\}}\right)}, \end{equation}

where \( t \) is time, \( x \) is longitude, \( y \) is latitude, \( μ_i(t) \) is the coordinates of the vehicle \( i \) at time \( t \), \( \Sigma \) is the covariance matrix for aggregation compatibility, \( n \) is the number of standard deviations the limit of truncation is away from the mean, and \( c \) is the normalizing constant for the truncated bivariate normal distribution.

In order to operationalize this model, I implement the following features:

  • Truncated bivariate normal distribution to represent aggregation compatibility (see class TruncBiNorm and TruncBiNormOverlap in taxi.py)
  • Use a trip's pick-up time, origin, and destination to instantiate an trip object with corresponding trajectory and aggregation compatibility field (see class Trip in taxi.py)

This model has three parameters that impact the efficiency calculation for a given set of taxi trip data:

  • Vehicle velocity. Assume 15 km/h (about 9.3 mph), or 250 m/min.
    • 0.00225 lat/min
    • 0.00296 lon/min
  • Standard deviation of aggregation compatibility. Assume 300 m (just over 1 Manhattan avenue).
    • 0.00270 lat
    • 0.00355 lon
  • Limits of bivariate Gaussian truncation. Assume 3 standard deviations.

1.1 Example: (Roughly) overlapping rides

COLNAMES = [
    'pick_time', 'drop_time',
    'pick_lat', 'pick_lon', 'drop_lat', 'drop_lon',
    'n_passengers',
]

Consider two rides in Astoria, NY.

trips = pd.DataFrame(
    [
	(pd.Timestamp('2016-06-15 12:00:00'), np.nan,
	 40.7759, -73.9276, 40.7665, -73.9217, 1),
	(pd.Timestamp('2016-06-15 12:00:03'), np.nan,
	 40.7757, -73.9269, 40.7673, -73.9209, 1),
    ],
    columns=COLNAMES,
)
output_notebook()
taxi.plot_zones_and_trips(['Astoria'], trips)
Loading BokehJS ...

These two rides have high mutual aggregation efficiencies, as they are close in both space and time.

zae = taxi.zone_aggeffs(trips)

print('Trip aggregation efficiencies:', zae)
print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.81878319 0.8921752 ]
Mean efficiency for zone: 0.8554791973249489

However if I start the second one three minutes later, their aggregation efficiencies fall.

trips = pd.DataFrame(
    [
	(pd.Timestamp('2016-06-15 12:00:00'), np.nan,
	 40.7759, -73.9276, 40.7665, -73.9217, 1),
	(pd.Timestamp('2016-06-15 12:03:03'), np.nan,
	 40.7757, -73.9269, 40.7673, -73.9209, 1),
    ],
    columns=COLNAMES,
)
zae = taxi.zone_aggeffs(trips)

print('Trip aggregation efficiencies:', zae)
print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.07696682 0.08386577]
Mean efficiency for zone: 0.08041629551941824

1.2 Example: Multiple partially overlapping rides

For another example, the aggregation efficiency of the second of the following three trips reflects its space–time overlap with the first and third rides.

trips = pd.DataFrame(
    [
	(pd.Timestamp('2016-06-15 12:00:00'), np.nan,
	 40.7759, -73.9276, 40.7665, -73.9217, 1),
	(pd.Timestamp('2016-06-15 12:01:00'), np.nan,
	 40.7710, -73.9240, 40.7630, -73.9190, 1),
	(pd.Timestamp('2016-06-15 12:02:00'), np.nan,
	 40.7651, -73.9200, 40.7608, -73.9170, 1),
    ],
    columns=COLNAMES,
)
output_notebook()
taxi.plot_zones_and_trips(['Astoria'], trips)
Loading BokehJS ...
zae = taxi.zone_aggeffs(trips)

print('Trip aggregation efficiencies:', zae)
print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.48252141 0.71270788 0.55640971]
Mean efficiency for zone: 0.583879668259483

1.3 Example: Unaggregable rides

Finally, unaggregable rides do not impact the aggregation efficiencies of other trips, but they do reduce the zone-level aggregation efficiency.

trips = pd.DataFrame(
    [
	(pd.Timestamp('2016-06-15 12:00:00'), np.nan,
	 40.7759, -73.9276, 40.7665, -73.9217, 1),
	(pd.Timestamp('2016-06-15 12:01:00'), np.nan,
	 40.7710, -73.9240, 40.7630, -73.9190, 1),
	(pd.Timestamp('2016-06-15 12:02:00'), np.nan,
	 40.7651, -73.9200, 40.7608, -73.9170, 1),
	(pd.Timestamp('2016-06-15 12:02:00'), np.nan,
	 40.7670, -73.9100, 40.7600, -73.9100, 1),
    ],
    columns=COLNAMES,
)
output_notebook()
taxi.plot_zones_and_trips(['Astoria'], trips)
Loading BokehJS ...
zae = taxi.zone_aggeffs(trips)

print('Trip aggregation efficiencies:', zae)
print('Mean efficiency for zone: {}'.format(np.mean(zae)))
Trip aggregation efficiencies: [0.48252141 0.71270788 0.55640971 0.        ]
Mean efficiency for zone: 0.4379097511946123

2 Simulation and analysis

I calculate aggregation efficiencies for individual trips with various zones and time windows. I take Midtown to represent Manhattan and examine the following zone combinations:

  • Astoria
  • Astoria–Midtown
  • Astoria–Midtown–LGA
  • UES
  • UES–Midtown

I perform the calculation over six different times of the week:

  • Weekday morning (Wednesday starting at 08:00)
  • Weekday afternoon (Wednesday starting at 12:00)
  • Weekday evening (Wednesday starting at 18:00)
  • Weekend morning (Saturday starting at 08:00)
  • Weekend afternoon (Saturday starting at 12:00)
  • Weekend evening (Saturday starting at 18:00)

Because the aggregation efficiency calculation is computationally intensive, I use various windows of time ranging from 1–120 minutes to limit the number of trips to a reasonable level. For each time window with more than 30 trips, I randomly sample just 30 trips to perform the aggregation efficiency calculation. The table below shows the zone‐level aggregation efficiencies.

# Load aggregation efficiency calculation layout
ae_layout = pd.read_csv('aggeff_layout.csv')

# Load trip data and aggregation efficiency results
trips, zae = [], []
for i in ae_layout.index:
    with open('results/trips_{}.pkd'.format(i), 'rb') as f:
	trips.append(pickle.load(f))
    with open('results/zae_{}.pkd'.format(i), 'rb') as f:
	zae.append(pickle.load(f))
pd.concat(
    [
	ae_layout[['zone_label', 'time_label', 'start_time', 'duration']],
	pd.DataFrame(
	    [(len(td), np.mean(ae)) for td, ae in zip(trips, zae)],
	    columns=['n_trips', 'efficiency'],
	),
    ],
    axis=1,
).sort_values(['zone_label', 'start_time'])
zone_label time_label start_time duration n_trips efficiency
0 Astoria Weekday morning 2016-06-15 08:00 60 34 0.081083
5 Astoria Weekday afternoon 2016-06-15 12:00 60 34 0.127128
10 Astoria Weekday evening 2016-06-15 18:00 60 58 0.180803
15 Astoria Weekend morning 2016-06-18 08:00 60 19 0.063788
20 Astoria Weekend afternoon 2016-06-18 12:00 60 40 0.116931
25 Astoria Weekend evening 2016-06-18 18:00 60 59 0.252794
1 Astoria-Midtown Weekday morning 2016-06-15 08:00 90 11 0.368971
6 Astoria-Midtown Weekday afternoon 2016-06-15 12:00 120 8 0.106940
11 Astoria-Midtown Weekday evening 2016-06-15 18:00 60 11 0.473867
16 Astoria-Midtown Weekend morning 2016-06-18 08:00 60 15 0.275607
21 Astoria-Midtown Weekend afternoon 2016-06-18 12:00 60 13 0.124903
26 Astoria-Midtown Weekend evening 2016-06-18 18:00 60 21 0.212501
2 Astoria-Midtown-LGA Weekday morning 2016-06-15 08:00 3 17 3.110343
7 Astoria-Midtown-LGA Weekday afternoon 2016-06-15 12:00 5 21 2.762013
12 Astoria-Midtown-LGA Weekday evening 2016-06-15 18:00 5 14 1.312172
17 Astoria-Midtown-LGA Weekend morning 2016-06-18 08:00 10 23 0.999410
22 Astoria-Midtown-LGA Weekend afternoon 2016-06-18 12:00 10 27 1.846878
27 Astoria-Midtown-LGA Weekend evening 2016-06-18 18:00 15 14 0.339333
3 UES Weekday morning 2016-06-15 08:00 5 115 2.966929
8 UES Weekday afternoon 2016-06-15 12:00 5 122 2.983721
13 UES Weekday evening 2016-06-15 18:00 5 125 3.674477
18 UES Weekend morning 2016-06-18 08:00 5 34 1.031728
23 UES Weekend afternoon 2016-06-18 12:00 5 68 1.804801
28 UES Weekend evening 2016-06-18 18:00 5 74 2.187959
4 UES-Midtown Weekday morning 2016-06-15 08:00 1 35 3.023929
9 UES-Midtown Weekday afternoon 2016-06-15 12:00 2 42 2.836140
14 UES-Midtown Weekday evening 2016-06-15 18:00 1 25 1.526476
19 UES-Midtown Weekend morning 2016-06-18 08:00 2 12 0.928378
24 UES-Midtown Weekend afternoon 2016-06-18 12:00 2 43 2.458989
29 UES-Midtown Weekend evening 2016-06-18 18:00 2 30 1.305169

2.1 Assess efficiency of aggregating rides within Astoria

After calculating the aggregation efficiencies for the various zone–time combinations, I pool the trip‐by‐trip aggregation efficiencies by zone to compare the overall efficiency quantiles and sample statistics. The numbers below reflect individual trip aggregation efficiencies across weekday‐weekend and morning‐afternoon‐evening combinations.

def aggeff_stats_by_zone(zones):
    d = {
	z: np.concatenate([zae[i] for i in ae_layout.index[ae_layout['zone_label'] == z]])
	for z in zones
    }
    return pd.DataFrame([
	pd.concat([pd.Series([z], index=['zone']), pd.Series(d[z]).describe()])
	for z in zones
    ])


aggeff_stats_by_zone(
    ['Astoria', 'Astoria-Midtown', 'Astoria-Midtown-LGA', 'UES', 'UES-Midtown']
)
zone count mean std min 25% 50% 75% max
0 Astoria 169.0 0.141859 0.248959 0.0 0.000000 0.003992 0.211858 1.200064
1 Astoria-Midtown 79.0 0.257558 0.276845 0.0 0.032165 0.166450 0.420903 1.255381
2 Astoria-Midtown-LGA 116.0 1.783201 1.346923 0.0 0.696486 1.429303 2.829936 5.210652
3 UES 180.0 2.441603 1.805567 0.0 1.003623 2.101666 3.500081 10.291479
4 UES-Midtown 157.0 2.153052 1.526896 0.0 0.998387 2.040324 3.093651 6.243472
taxi.plot_zones_and_trips(
    ['Astoria'],
    pd.concat([trips[i] for i in ae_layout.index[ae_layout['zone_label'] == 'Astoria']]),
)

2.2 Comparing zones

The preceding table shows that ride aggregation is a much more efficient process in the Upper East Side (UES) than in Astoria, about

\begin{equation*} \frac{1 + 2.44}{1 + 0.14} = 3.02 \end{equation*}

times more efficient when comparing means.

A histogram shows that ride aggregation efficiency has a much greater range in UES.

ae_astoria = np.concatenate(
    [zae[i] for i in ae_layout.index[ae_layout['zone_label'] == 'Astoria']]
)
ae_ues = np.concatenate(
    [zae[i] for i in ae_layout.index[ae_layout['zone_label'] == 'UES']]
)

plt.hist(ae_astoria, alpha=0.5, label='Astoria')
plt.hist(ae_ues, alpha=0.5, label='UES')
plt.legend(loc='upper left')
plt.show()
taxi.plot_zones_and_trips(
    ['Upper East Side', taxi.get_manhattan_neighborhoods()],
    pd.concat([trips[i] for i in ae_layout.index[ae_layout['zone_label'] == 'UES']]),
)

2.3 Scope of service

Ride aggregation efficiency can function as a decision variable for determining scope of service. For example, ride aggregation efficiency is generally higher between Astoria and Manhattan than within Astoria, suggesting that interzone service could be feasible in this case.

taxi.plot_zones_and_trips(
    ['Astoria', 'Midtown', taxi.get_manhattan_neighborhoods()],
    pd.concat([trips[i] for i in ae_layout.index[ae_layout['zone_label'] == 'Astoria-Midtown']]),
)

2.4 Service scheduling

The following table of statistics suggests that ride aggregation efficiencies in the area of study are comparable between weekdays and weekends and moreover that efficiencies tend to increase over the course of the day. Since optimal service scheduling depends more heavily on ride demand than on ride aggregability, a scheduling solution would most likely involve modulating driver supply according diurnal demand variation.

def filter_stats_by_time(times, zone):
    d = {
	t: np.concatenate([
	    zae[i]
	    for i in ae_layout.index[
		(ae_layout['time_label'] == t)
		& (ae_layout['zone_label'] == zone)
	    ]
	])
	for t in times
    }
    return pd.DataFrame([
	pd.concat([pd.Series([t], index=['time']), pd.Series(d[t]).describe()])
	for t in times
    ])


filter_stats_by_time(
    [
	'Weekday morning',
	'Weekday afternoon',
	'Weekday evening',
	'Weekend morning',
	'Weekend afternoon',
	'Weekend evening',
    ],
    'Astoria',
)
time count mean std min 25% 50% 75% max
0 Weekday morning 30.0 0.081083 0.175967 0.0 0.000000 0.000000 0.022029 0.692394
1 Weekday afternoon 30.0 0.127128 0.261474 0.0 0.000000 0.000019 0.153857 0.989229
2 Weekday evening 30.0 0.180803 0.205144 0.0 0.000000 0.139620 0.292863 0.781462
3 Weekend morning 19.0 0.063788 0.159064 0.0 0.000000 0.000000 0.023260 0.551952
4 Weekend afternoon 30.0 0.116931 0.231197 0.0 0.000000 0.005462 0.064922 0.882757
5 Weekend evening 30.0 0.252794 0.354266 0.0 0.005164 0.053167 0.440431 1.200064

Source code