1 Introduction
Selfdriving has the potential to revolutionize transportation and is a major field of AI applications. Even though already in 1990 there were prototypes capable of driving on highways [13]
, technology is still not widespread, especially in the context of urban driving. In the past decade, the availability of large datasets and highcapacity neural networks has enabled significant progress in perception
[26, 12]and the vehicles’ ability to understand their surrounding environment. Selfdriving decision making, however, has seen very little benefit from machine learning or large datasets. Stateoftheart planning systems used in industry
[15] still heavily rely on trajectory optimisation techniques with expertdefined cost functions. These cost functions capture desirable properties of the future vehicle path. However, engineering these cost functions scales poorly with the complexity of driving situations and the long tail of rare events.Due to this, learning a driving policy directly from expert demonstrations is appealing, since performance scales to new domains by adding data rather than via additional human engineering effort. In this paper we focus specifically on learning rich driving policies for urban driving from large amounts of realworld collected data. Unlike highway driving [19], urban driving requires performing a variety of maneuvers and interactions with, e.g., traffic lights, other cars and pedestrians.
Recently, rich midlevel representations powered by largescale datasets [21, 11], HDmaps and highperformance perception systems enabled capturing nuances of urban driving. This led to new methods achieving high performance for motion prediction [16, 29]. Furthermore, [3] demonstrated that leveraging these representations and behavioral cloning with state perturbations leads to learning robust driving policies. While promising, difficulty of this approach lies in engineering the perturbation noise mechanism required to avoid covariate shift between training and testing distribution.
Inspired by this approach, we present the first results on offline learning of imitating driving policies using midlevel representations, a closedloop simulator and a policy gradient method. This formulation has several benefits: it can successfully learn highcomplexity maneuvers without the need for perturbations, implicitly avoid the problem of covariate shift, and directly optimize imitation as well as auxiliary costs. The proposed simulator is constructed directly from the collected logs of realworld demonstrations and HD maps of the area, and can synthesize new realistic driving episodes from past experiences (see Figure 1
for an overview of our method). Furthermore, for training on large datasets reducing the computational complexity is paramount. We leverage vectorized representations and show how this allows for computing policy gradients quickly using backpropagation through time. We demonstrate how these choices lead to superior performance of our method over the existing stateoftheart in imitation learning for realworld selfdriving planning in urban areas.
Our contributions are fourfold:

The first demonstration of policy gradient learning of imitative driving policies for complex urban driving from a large corpus of realworld demonstrations. We leverage a closedloop simulator and rich, midlevel vectorized representations to learn policies capable of performing a variety of maneuvers.

A new differentiable simulator that enables efficient closedloop simulation of realistic driving experiences based on past demonstrations, and quickly compute policy gradients by backpropagation through time, allowing fast learning.

A comprehensive qualitative and quantitative evaluation of the method and its performance compared to existing stateoftheart. We show that our approach, trained purely in simulation can control a realworld selfdriving vehicle, outperforms other methods, generalizes well and can effectively optimize both imitation and auxiliary costs.

Source code and data are made available to the public^{1}^{1}1Code and video available at https://planning.l5kit.org..
2 Related Work
In this section we summarize different approaches for solving selfdriving vehicle (SDV) decisionmaking in both academia and industry. In particular, we focus on both optimisationbased and MLbased systems. Furthermore, we discuss the role of representations and datasets as enablers in recent years, to tackle progressively more complex Autonomous Driving (AD) scenarios.
Trajectorybased optimization is still a dominant approach used in industry for both highway and urbandriving scenarios. It relies on manually defined costs and reward functions that describe good driving behavior. Such cost can then be optimized using a set of classical optimization techniques (A* [47], RRTs [22], POMDP with solver [2], or dynamic programming [31]). Appealing properties of these approaches are their interpretability and functional guarantees, which are important for AD safety. These methods, however, are very difficult to scale. They rely on human engineering rather than on data to specify functionality. This becomes especially apparent when tackling complex urban driving scenarios, which we address in this work.
Reinforcement learning (RL) [40] removes some complexity of human engineering by providing a reward (cost) signal and uses ML to learn an optimal policy to maximize it. Directly providing the reward through realtime disengagements [23], however, is impractical due to a low sampleefficiency of RL and the involved risks. Therefore, most approaches [24] rely on constructing a simulator and explicitly encoding and optimising a reward signal [37]. A limiting factor of these approaches is that the simulator often is handengineered [14, 30], limiting its ability to capture longtail realworld scenarios. Recent examples of simtoreal policy transfer (e.g. [43], [33], [1]) were not focused on evaluating scenarios typical to urban driving, in particular interacting with other agents. In our work, we construct the simulator directly from realworld logs through midlevel representations. This allows training in a variety of realworld scenarios with other agents present, while employing efficient policygradient learning.
Imitation learning (IL) and Inverse Reinforcement Learning (IRL) [35, 46] are more scalable ML approaches that leverage expert demonstrations. Instead of learning from negative events, aim is to directly copy expert behavior or recover the underlying expert costs. Simple behavioral cloning was applied already back in 1989 [34] on rural roads, more recently by [8] on highways and [17] in urban driving. Naive behavioral cloning, however, suffers from covariate shift [35]. This issue has been successfully tackled for highway lanefollowing scenarios by reframing the problem as classification task [7] or employing a simple simulator [19], constructed from highway cameras. We take inspiration from these approaches but focus on the significantly more complex task of urban driving. Theoretically, our work is motivated by [42], as we employ a similar principle of generating synthetic corrections to simulate querying an expert. Due to this, identical proven guarantees hold for our method, namely the ideal linear regret bound, mitigating the problem of covariate shift. Adversarial Imitation Learning comprises another important field [4, 20, 6], but, to the best of our knowledge, has seen little application to autonomous driving and no actual SDV deployment yet.
Neural Motion Planners are another approach used for autonomous driving. In [45]
raw sensory input and HDmaps are used to estimate cost volumes of the goodness of possible future SDV positions. Based on these cost volumes, trajectories can be sampled and the lowestcost one is selected to be executed. This was further improved in
[10], where the dependency on HDmaps was dropped. To the best of our knowledge, these promising methods have not yet been demonstrated to drive a car in the realworld though.Midrepresentations and the availability of largescale realworld AD datasets [21, 11] have been major enablers in recent years for tackling complex urban scenarios. Instead of learning policies directly from sensor data, the input of the model comprises the output of the perception system as well as an HD map of the area. This representation compactly captures the nuance of urban scenarios and allows largescale training on hundreds or thousands of hours of real driving situations. This led to new stateoftheart solutions for motion forecasting [16, 29]. Moreover, [3] demonstrated that using midrepresentations, largescale datasets and simple behavioral cloning with perturbations [27] can scale and learn robust planning policies. The difficulty of this approach, however, is in engineering the noise model to overcome the covariate shift problem. In our work we are inspired by this approach, but attempt to learn robust policies using policy gradient optimisation [18] featuring unrolling and evaluating the policy during training. This implicitly avoids the problem of covariate shift and leads to superior results. This approach is, however, more computationally expensive and requires a simulator. To solve this, we show how a fast and powerful simulator can be constructed directly from realworld logs enabling scalability of this approach.
Datadriven simulation.
A realistic simulator is useful for both training and validation of ML models. However, many current simulators (e.g. [14, 28]
) depend on heuristics for vehicle control and do not capture the diversity of real world behaviours. Datadriven simulators are designed to alleviate this problem.
[1] created a photorealistic simulator for training an endtoend RL policy. [19] simulated a bird’seye view of dense traffic on a highway. Finally, two recent works [5, 39] developed datadriven simulators and showed their usefulness for training and validating ML planners. In this work we show that a simpler, differentiable simulator based on replaying logs is effective for training.3 Differentiable Traffic Simulator from Realworld Driving Data
In this section we describe a differentiable simulator that approximates new driving experiences based on an experience collected in the real world. This simulator is used during policy learning for the closedloop evaluation of the current policy’s performance and computing the policy gradient. As shown in Section 5, differentiability is an important building block for achieving good results, especially when employing auxiliary costs.
We represent the realworld experience as a sequence of state observations around the vehicle over time:
(1) 
We use a vectorized representation based on [16], in which each state observation consists of a collection of static and dynamic elements around the vehicle pose , with denoting the special Euclidean group. Static elements include traffic lanes, stop signs and pedestrian crossings. These elements are extracted from the underlying HD semantic map of the area using the localisation system. The dynamic elements include traffic lights status and traffic participants (other cars, buses, pedestrians and cyclists). These are detected in realtime using the onboard perception system. Each element includes a pose relative to the SDV pose , as well as additional features, such as the element type, time of observation, and other optional attributes, e.g. the color of associated traffic lights, recent history of moving vehicles, etc. The full details of this representation are provided in Appendix C.
Goal of the simulation is to iteratively generate a sequence of state observations that corresponds to a different sequence of driver actions in the scenario. This is done by first computing the corresponding SDV trajectory and then locally transforming states .
Updated poses of the SDV are determined by a kinematic model , which is assumed to be differentiable. The state observation is then obtained by computing the new position for each state element using a transformation along the differences of the original and updated pose:
(2) 
See Figure 2 for an illustrative example. It is worth noting that this approximation is effective if the distance between the original and generated SDV pose is not too large.
We denote performing these steps in sequence with the stepbystep simulation transition function . Moreover, since both Equation (2) and vehicle dynamics are fully differentiable, we can compute gradients with respect to both the state () and action (). This is critical for the efficient computation of policy gradients using backpropagation through time as described in the next section.
4 Imitation Learning Using a Differentiable Simulator
In this part, we detail how we use the simulator described in the previous section to learn a deterministic policy to drive a car using closedloop policy learning.
We frame the imitation learning problem as minimisation of the L1 pose distance between the expert and learner on a sequence of collected realworld demonstrations . Note that with a slight abuse of notation we use the poses here to refer to 3D vectors , instead of rototranslation matrices in – yielding the common L1 norm and loss. This can be expressed as a discounted cumulative expected loss [32] on the set of collected expert scenarios:
(3) 
Optimising this objective pushes the trajectory taken by the learned policy as close as possible to the one of the expert, as well as limiting the trajectory to the region where the approximation given by the simulator is effective. In Appendix B we further extend this to include auxiliary cost functions with the aim of optimising additional objectives.
We can use any policy optimisation method [36, 25] to optimize Equation (3). However, given that the transition is differentiable, we can exploit it for a more effective training that does not require a separate estimation of a value function. As shown in [4, 18, 38], this results into an order of magnitude more efficient training. The optimisation process consists of repeatedly sampling pairs of expert and policy trajectories , and computing the policy gradient for these samples to minimize Equation (3). We describe both steps in detail in the following subsections.
4.1 Sampling from a Policy Distribution
In this subsection we detail sampling pairs of expert () and corresponding policy trajectory () drawn from policy .
Sampling expert trajectories consists of simply sampling from the collected dataset of expert demonstrations. To generate the policy sample we acquire an expert state , and then unroll the current policy for steps using the simulator .
This naive method, however, introduces bias, as the initial state of the trajectory is always drawn from the expert and not from the policy distribution . As shown in Appendix B, this results in the underperformance of the method. To remove this bias we discard the first timesteps from both trajectories and use only the remaining timesteps to estimate the policy gradient as described next (see Figure 3 for a visualization).
4.2 Computing Policy Gradient
Here we describe the computation of the policy gradient around the rollout trajectory given by the current policy. This gradient can be computed for deterministic policies using backpropagation through time leveraging the differentiability of the simulator . Note that we denote partial differentiation with subscripts, i.e. . We follow the formulation in [18] and express the gradient by a pair of recursive formulas:
(4)  
(5) 
5 Experiments
In this section we evaluate our proposed method and benchmark it against existing stateoftheart systems. In particular, we are interested in: its ability to learn robust policies dealing with various situations observed in the real world; its ability to tailor performance using auxiliary costs; the sensitivity of key hyperparameters; and the impact on performance with increasing amounts of training data. Additional results can be found in the appendix and the accompanied video.
5.1 Dataset
For training and testing our models we use the Lyft Motion Prediction Dataset [21]. This dataset was recorded by a fleet of Autonomous Vehicles and contains samples of realworld driving on a complex, urban route in Palo Alto, California. The dataset captures various realworld situations, such as driving in multilane traffic, taking turns, interactions with vehicles at intersections, etc. Data was preprocessed by a perception system, yielding the precise position of nearby vehicles, cyclists and pedestrians over time. In addition, a highdefinition map provides locations of lane markings, crosswalks and traffic lights. All models are trained on a 100h subset, and tested on 25h. The training dataset is identical to the publicly available one, whereas for the sake of execution speed for testing we use a random, but fixed, subset of the listed test dataset, which is roughly in size.
5.2 Baselines
We compare our proposed algorithm against three stateoftheart baselines:

Naive Behavioral Cloning (BC): we implement standard behavioral cloning using our vectorized backbone architecture. We do not use the SDV’s history as an input to the model to avoid causal confusion (compare [3]).

Behavioral Cloning + Perturbations (BCperturb): we reimplement a vectorized version of ChauffeurNet [3] using our backbone network. As in the original paper, we add noise in the form of perturbations during training, but do not employ any auxiliary losses. We test two versions: without the SDV’s history, and using the SDV’s history equipped with history dropout.

Multistep Prediction (MS Prediction): we apply the metalearning framework proposed in [42] to train our vectorized network. We observe that a version of this algorithm can conveniently be expressed within our framework; we obtain it by explicitly detaching gradients between steps (i.e. ignoring the full differentiability of our simulation environment). Differently from the original work [42], we do not save past unrolls as new dataset samples over time.
5.3 Implementation
Inspired by [16, 41], we use a graph neural network for parametrizing our policy. It combines a PointNetlike architecture for local inputs processing followed by an attention mechanism for global reasoning. In contrast to [16], we use points instead of vectors. Given the set of points corresponding to each input element, we employ 3 PointNet layers to calculate a 128dimensional feature descriptor. Subsequently, a single layer of scaled dotproduct attention performs global feature aggregation, yielding the predicted trajectory. We found and to work well, i.e. we use 20 timesteps for the initial sampling and effectively predict 12 trajectory steps. is set to 0.8. In total, our model contains around 3.5 million trainable parameters, and training takes 30h on 32 Tesla V100 GPUs. For more details we refer to Appendix C.
For the vehicle kinematics model we use an unconstrained model with . This allows for a fair comparisons with the baselines as both BCperturb and MS Prediction assume the possibility of arbitrary pose corrections. Other kinematics models, such as unicycle or bicycle models, could be used with our method as well.
All baseline methods share the same network backbone as ours, with model specific differences as described above – and BC and BCperturb predicting a full Tstep trajectory with a single forward, while MS Prediction and ours are calling the model times. To ensure a fair comparison, also for MS Prediction we use our proposed sampling procedure, i.e. use the first
steps for sampling only. We train all models for 61 epochs with a learning rate of
, and drop it to after 54 epochs. We note that we achieve best results for our proposed method by disabling dropout, and hypothesize this is related to similar issues observed for RNNs [44].We refer the reader to Appendix B for ablations on the influence on data and sampling.
5.4 Metrics
We implement the metrics describe below to evaluate the planning performance. These capture key imitation performance, safety and comfort. In particular, we report the following, which are normalized – if applicable – per 1000 miles driven by the respective planner:

L2: L2 distance to the underlying expert position in the driving log in meters.

Offroad events: we report a failure if the planner deviates more than 2m laterally from the reference trajectory – this captures events such as running offroad and into opposing traffic.

Collisions: collisions of the SDV with any other agent, broken down into front, side and rear collisions w.r.t. the SDV.

Comfort: we monitor the absolute value of acceleration, and raise a failure should this exceed 3 m/s.

I1K: we accumulate safetycritical failures (collisions and offroad events) into one key metric for ease of comparison, namely Interventions per 1000 Miles (I1K).
Configuration  Collisions  Imitation  

Model  SDV history  Front  Side  Rear  Offroad  L2  Comfort  I1K 
BC  79 23  395 170  997 74  1618 459  1.57 0.27  93K 3K  3,091 601  
BCperturb  16 2  56 6  411 146  82 11  0.74 0.01  203K 6K  567 128  
BCperturb  14 4  73 7  678 11  77 6  0.77 0.01  636K 22K  843 6  
MS Prediction  18 6  55 4  125 14  141 31  0.46 0.02  595K 49K  341 39  
Ours  15 7  46 5  101 13  97 6  0.42 0.00  637K 41K  260 9 
Normalized metrics for all baselines and our method – reporting mean and standard deviation for each as obtained from 3 runs. For all, lower is better. Our method overall yields best performance and lowest I1K.
5.5 Imitation Results
We evaluate our method and all the baselines by unrolling the policy on 3600 sequences of 25 seconds length from the test set and measure the above metrics.
Table 1 reports performance when all methods are trained to optimize the imitation loss alone. Behavioral cloning yields a high number of trajectory errors and collisions. This is expected, as this approach is known to suffer from the issue of covariate shift [35]. Including perturbation during training dramatically improves performance as it forces the method to learn how to recover from drifting. We further observe that MS Prediction yields comparable results for many categories, while yielding less rear collisions. We attribute this to the further reduction of covariate shift when compared to the previous methods: the training distribution is generated onpolicy instead of being synthesized by adding noise. Finally, our method yields best results overall. It is worth noting that all models share a high number of comfort failures, due to the fact that they are all trained for imitation performance alone, which does not optimize for comfort, but only positional accuracy of the driven vehicle – which we address in the appendix.
5.6 Incar Testing
In addition to above stated simulation results, we further deployed our planner on SDVs in the real. For this, a Ford Fusion equipped with 7 camera, 3 LiDAR and 11 Radar sensors was employed. The sensor setup thus equals the one used for data collection, and during roadtesting our perception and dataprocessing stack is run in realtime to generate the desired scene representation on the fly. For this, vehicles are equipped with 8 Nvidia 2080 TIs. Experiments were conducted on a private test track, including other traffic participants and reproducing challenging driving scenarios. Furthermore, this track was never shown to the network before, and thus offers valuable insights into generalization performance. Figure
5 shows our model successfully crossing a signaled intersection, for more results we refer to the appendix and our supplementary video.6 Conclusion
In this work we have introduced a method for learning an autonomous driving policy in an urban setting, using closedloop training, midlevel representations with a datadriven simulator and a large corpus of real world demonstrations. We show this yields good generalization and performance for complex, urban driving. In particular, it can control a realworld selfdriving vehicle, yielding better driving performance than other stateoftheart ML methods.
We believe this approach can be further extended towards productiongrade realworld driving requirements of L4 and L5 systems – in particular, for improving performance in novel or rarely seen scenarios and to increase sample efficiency, allowing further scaling to millions of hours of driving.
We would like to thank everyone at Level 5 working on datadriven planning, in particular Sergey Zagoruyko, Yawei Ye, Moritz Niendorf, Jasper Friedrichs, Li Huang, Qiangui Huang, Jared Wood, Yilun Chen, Ana Ferreira, Matt Vitelli, Christian Perone, Hugo Grimmett, Parth Kothari, Stefano Pini, Valentin Irimia and Ashesh Jain. Further we would like to thank Alex Ozark, Bernard Barcela, Alex Oh, Ervin Vugdalic, Kimlyn Huynh and Faraz Abdul Shaikh for deploying our planner to SDVs in the real.
References
 [1] (2020) Learning robust control policies for endtoend autonomous driving from datadriven simulation. Robotics and Automation Letters. Cited by: §2, §2.
 [2] (2013) Intentionaware motion planning. In Algorithmic Foundations of Robotics X, E. Frazzoli, T. LozanoPerez, N. Roy, and D. Rus (Eds.), Cited by: §2.
 [3] (201812) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. Cited by: §1, §2, 1st item, 2nd item.
 [4] (2017) Endtoend differentiable adversarial imitation learning. In Int. Conf. on Machine Learning, Cited by: §2, §4.
 [5] (2021) SimNet: learning reactive selfdriving simulations from realworld observations. Int. Conf. on Robotics and Automation. Cited by: §2.
 [6] (2020) Modeling human driving behavior through generative adversarial imitation learning. ArXiv. Cited by: §2.
 [7] (2020) The nvidia pilotnet experiments. arXiv preprint arXiv:2010.08776. Cited by: §2.
 [8] (2016) End to end learning for selfdriving cars. ArXiv. Cited by: §2.
 [9] (2020) Endtoend object detection with transformers. Cited by: Appendix C: Policy architecture and state representation.

[10]
(2021)
MP3: a unified model to map, perceive, predict and plan.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 14403–14412. Cited by: §2.  [11] Argoverse: 3d tracking and forecasting with rich maps supplementary material. Int. Conf. on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.

[12]
(2017)
PointNet: deep learning on point sets for 3d classification and segmentation
. In Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.  [13] (2010) Dynamic vision for perception and control of motion. Cited by: §1.
 [14] (2017) CARLA: An open urban driving simulator. In 1st Annual Conference on Robot Learning, Cited by: §2, §2.
 [15] (2018) Baidu apollo em motion planner. ArXiv. Cited by: §1.
 [16] (2020) VectorNet: encoding hd maps and agent dynamics from vectorized representation. In Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3, §5.3.
 [17] (2020) Urban driving with conditional imitation learning. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §2.
 [18] (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, Cited by: §2, §4.2, §4.
 [19] (2019) Modelpredictive policy learning with uncertainty regularization for driving in dense traffic. ArXiv. Cited by: §1, §2, §2.
 [20] (2016) Generative adversarial imitation learning. arXiv preprint arXiv:1606.03476. Cited by: §2.
 [21] (2020) One thousand and one hours: selfdriving motion prediction dataset. Conference on Robot Learning (CoRL). Cited by: §1, §2, §5.1.
 [22] (2000) RRTconnect: an efficient approach to singlequery path planning. In Int. Conf. on Robotics and Automation, Cited by: §2.
 [23] (2019) Learning to drive in a day. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §2.
 [24] (2021) Deep reinforcement learning for autonomous driving: a survey. Transactions on Intelligent Transportation Systems. Cited by: §2.
 [25] (2000) Actorcritic algorithms. In SIAM Journal on Control and Optimization, Cited by: §4.
 [26] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §1.
 [27] (2017) DART: noise injection for robust imitation learning. In 1st Annual Conference on Robot Learning, Cited by: §2.
 [28] (2018) An environment for autonomous driving decisionmaking. Cited by: §2.
 [29] (2020) Learning lane graph representations for motion forecasting. Cited by: §1, §2.
 [30] (2018) Microscopic traffic simulation using sumo. In Int. Conf. on Intelligent Transportation Systems (ITSC), Cited by: §2.
 [31] (2009) Junior: the stanford entry in the urban challenge. In The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, Cited by: §2.
 [32] (2000) Algorithms for inverse reinforcement learning.. In Icml, Cited by: §4.
 [33] (2020) Simulationbased reinforcement learning for realworld autonomous driving. In Int. Conf. on Robotics and Automation (ICRA), Cited by: §2.
 [34] (1989) ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, Cited by: §2.

[35]
(2011)
A reduction of imitation learning and structured prediction to noregret online learning.
In
Fourteenth Int. Conf. on Artificial Intelligence and Statistics
, Cited by: §2, §5.5.  [36] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.
 [37] (2016) Safe, multiagent, reinforcement learning for autonomous driving. ArXiv. Cited by: §2.
 [38] Deeply AggreVaTeD: differentiable imitation learning for sequential prediction. In Int. Conf. on Machine Learning, Cited by: §4.
 [39] (2021) TrafficSim: learning to simulate realistic multiagent behaviors. Cited by: §2, Results for Optimizing Auxiliary Costs, Appendix D: Differentiable Collision Loss.
 [40] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
 [41] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §5.3.
 [42] (2015) Improving multistep prediction of learned time series models. In AAAI, Cited by: §2, 3rd item.
 [43] (2017) Virtual to real reinforcement learning for autonomous driving. Cited by: §2.
 [44] (2014) Recurrent neural network regularization. Cited by: §5.3.
 [45] (2019) Endtoend interpretable neural motion planner. Int. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.
 [46] (2008) Maximum entropy inverse reinforcement learning. In National Conference on Artificial Intelligence, Cited by: §2.
 [47] (2008) Navigating carlike robots in unstructured environments using an obstacle sensitive cost function. In Intelligent Vehicles Symposium, Cited by: §2.
Appendix A: Qualitative results
Figure 6 shows our method handling diverse, complex traffic situations well  it is identical to Figure 4 of the paper, but enlarged. For more qualitative results we refer to the supplementary video.
Incar testing
In this section we report additional results of deploying our trained policy to SDVs. Figure 7 shows our planner navigating through a multitude of challenging scenarios. For more results we refer to the supplementary video  where she show additional results in the form of videos, which also contain more information, namely different camera angles, the resulting scene understanding and planned trajectory of the SDV.
Appendix B: Additional Quantitative Results
Results for Optimizing Auxiliary Costs
In this section we investigate the ability to not only imitate expert behavior, but also to directly optimize metrics of interest. This mode blends pure imitation learning with reinforcement learning and allows tailoring certain aspects of the behavior, i.e. to optimize comfort or safety. To illustrate this, we consider optimising a mixed cost function that optimizes both L1 imitation loss and auxiliary losses:
(6) 
Here is the magnitude of the acceleration at time and is a differentiable collision indicator, with denoting the set of other vehicles. This loss is based on [39], more details can be found in Appendix D. , allow to weigh the influence of these different losses.
The ability to succeed on this task requires optimally tradingoff short and longterm performance between pure imitation and other goals. Tables 2 and 3 summarize performance when including acceleration and collision loss, respectively. When including the acceleration term, we note our method is the only one to successfully tradeoff performance between imitation and comfort cost, thanks to its capability to directly optimize over the full distribution: while I1K slightly increases with growing – which is expected – we can push comfort failures down to arbitrary levels. All other models fail for at least one of these metrics, and / or are insensitive to . When including the collision loss, results are closer together. We hypothesize this is due to , allowing onestep corrections and thus requiring less reasoning over the full time horizon.
Ablation Studies
Figure 8 shows the impact of training dataset size on performance. We see the performance of the method improving with more data. Figure 9 demonstrates the effect of different on the performance of closedloop training and thus demonstrates the importance of proper sampling.
Discussion on Used Metrics
Metrics and their definition are naturally crucial for evaluating experiments  thus in the remainder of this section we list additional results using different thresholds and metrics. As reported in the paper, our default threshold for capturing deviations from the expert trajectory is 2m  which is based on average lane widths. Still, one can image wider lanes and less regulated traffic scenarios. Due to this Table 4 shows results of all examined methods using a threshold of 4m. Naturally, offroad failures increase, while other metrics improve due to our process of resetting after interventions. Still, one can observe that the reported results are relatively robust against such changes, i.e. the differences are small and relative trends still hold.
In the paper, for simplicity we measure comfort with one value, namely acceleration  which itself is based on differentiating speed, i.e. the travelled lateral and longitudinal distance divided by time. However, to reflect actual felt driving comfort, (longitudinal) jerk and lateral acceleration are better suited and more common in the industry. Therefore, Table 5 contains these additional values, and otherwise is identical to Table 1 of the original paper. These values yield more interesting insights into obtained driving behaviour, for example indicating that most discomfort is caused by longitudinal acceleration and jerk, while the lateral movement for all methods is much smoother. We further observe a similar theme as reported in the paper  namely that our method is the only one to be able to jointly optimize for performance and comfort, and that larger yield smoother driving. Still, we note that the number of jerk failures is higher than the number of acceleration failures  which leads to promising future experiments in the form of explicitly penalizing jerk instead, or in addition to, acceleration.
To complete this excursion on metrics, we briefly discuss rear collisions. Often, they can be attributed to mistakes of other traffic participants, or nonreactive simulation (consider choosing a slightly different velocity profile, resulting in a rear collision over time). Still, rear collisions can indicate severe misbehavior, such as not starting at green traffic lights, or sudden, unsafe braking maneuvers. See Figure 10 for an example. Due to this, we do include rear collisions in our aggregation metric I1K  however note that we report all metrics separately, as well, to allow a detailed, customized performance analysis.
Configuration  Collisions  Imitation  

Model  SDV history  Front  Side  Rear  Offroad  L2  Comfort  I1K 
BC  153 42  482 203  1,043 67  974 298  8.27 1.75  102K 1K  2,653 483  
BCperturb  22 4  57 8  414 142  27 5  3.06 0.06  204K 6K  512 127  
BCperturb  14 6  74 10  680 12  27 6  3.18 0.02  629K 23K  796 12  
MS Prediction  22 3  55 3  125 12  60 13  2.07 0.14  598K 49K  265 17  
Ours  17 7  51 5  102 12  40 6  1.83 0.04  638K 41K  210 9 
Configuration  Collisions  Imitation  Comfort  

Model  SDV history  Front  Side  Rear  Offroad  L2  Jerk  Lat. Acc.  I1K 
BC  79 23  395 170  997 74  1618 459  1.57 0.27  958K 46K  71 23  3,091 601  
BCperturb  16 2  56 6  411 146  82 11  0.74 1,15 0.01  1,156K 672K  1,115 278  567 128  
BCperturb  14 4  73 7  678 11  77 6  0.77 0.01  1,862K 46 K  7,285 593  843 6  
MS Prediction  18 6  55 4  125 14  141 31  0.46 0.02  1,600K 14K  211 21  341 39  
Ours  15 7  46 5  101 13  97 6  0.42 0.00  1,750K 196K  507 321  260 9 
Appendix C: Policy architecture and state representation
In this section we disclose full details of the proposed network architecture, shown in Figure 11: Each high level object (such as an agent, lane, cross walk) is comprised of a certain number of points of feature dimension . All points are individually embedded into a 128dimensional space. We then add a sinusoidal embedding to points of each object to introduce an understanding of order to the model, and feed this to our PointNet implementation. This consists of 3 PointNet layers, in the end producing a descriptor of size 128 for each object. We follow this up with one layer of scaled dotproduct attention: for this, the feature descriptor corresponding to ego is used as key, while all feature descriptors are taken as keys / value. We add an additional type embedding to the keys, s.t. the model can attend the values using also the object types – inspired by [9]. Via a final MLP the output is projected to the desired shape, i.e. for a trajectory of length , in which each step is described via position and a yaw angle.
A full description of our model input state is included in Table 6
. We define the state as the whole set of static and dynamic elements the model receive as input. Each element is composed of a variable number of points, which can represent both time (e.g. for agents) and space (e.g. for lanes). The number of features per point depends on the element type. We pad all features to a fixed size
to ensure they can share the first fully connected layer. We include all elements up to the listed maximal number in a circular FOV of radius 35m around the SDV. Note that for performance and simplicity we only execute this query once, and then unroll within this world state.State element(s)  Elements per state  Points per element  Point features description 
SDV  1  4  SDV X, Y, yaw pose of the current time step and in recent past 
Agents  up to 30  4  other agents X, Y, yaw poses of the current time step and in recent past 
Lanes mid  up to 30  20  interpolated X,Y points of the mid lanes with optional traffic light signal 
Lanes left  up to 30  20  interpolated X,Y points of the left lanes boundaries 
Lanes right  up to 30  20  interpolated X,Y points of the right lanes boundaries 
Crosswalks  up to 20  up to 20  crosswalks polygons boundaries X,Y points 
Appendix D: Differentiable Collision Loss
We use a similar differentiable collision loss as proposed in [39]: idea is approximating each vehicle via circles, and checking these for intersections. Assume loss calculation for timesteps to , we then define our collision loss as:
(7) 
Here, describes a pairwise collision term between our SDV and vehicle at timestep . Assume, and SDV (given by pose ) are represented via circles and , then is calculated as the maximum intersection of any two such circles:
(8) 
with
(9) 
in which denotes the distance between the respective circles’ centers and their radius. Thus, this term is 0 when the circles do not intersect, and otherwise grows linearly to a maximum value of 1.
Comments
There are no comments yet.