The electric car from IIIT Hyderabad is capable of point-to-point autonomous driving over a large distance while avoiding collisions. The vehicle is outfitted with 3D LIDAR, Depth Cameras, GPS systems, and AHRS (Attitude and Heading Reference System), which uses sensors on three axes to determine its direction in space. It can also take Open Set Natural Language commands and follow them to get to its destination. The campus environment is mapped using SLAM-based point cloud mapping, and localization while driving is made possible using LIDAR-guided real-time state estimate.
Navigate to the Open Set
Let’s start by thinking about how people typically navigate, which relies heavily on verbal instructions and contextual signals rather than maps, in order to better grasp open set navigation directives. Many navigational instructions, such as “Take right next to the white building” or “Drop off near the entrance,” are based on the recognition of particular environmental landmarks or features. On the other hand, for autonomous driving agents to navigate well, they require accurate information of their position (or localization) in the environment. Usually, high-resolution GPS or pre-made, memory- and computation-intensive High-Definition (HD) maps like Point Cloud are used to accomplish these.
IIITH’s Work
In order to tackle it, they take advantage of foundational models that possess a general semantic grasp of the world that can be extracted for tasks like navigation and localization later on. This has been accomplished by adding language markers, like “a bench,” “an underpass,” “football field,” etc., to open-source topological maps (like OSM) that mimic the cognitive localization method used by people. These make it possible for the model to navigate to locations for which it has not been explicitly trained, enabling a “open-vocabulary” feature that results in zero-shot generalization to novel scenarios.
Planning Differentially
The Autonomous Navigation Stack’s main constituents are planning, localization, and mapping. The incorporation of language modality is gradually becoming a de facto strategy to improve the explainability of autonomous driving systems, even if the modular pipeline and end-to-end designs have both been the conventional driving paradigms. The capacity to obey navigational instructions in natural language, such as “Take a right turn and stop near the food stall,” is a logical extension of these systems in the vision-language context. Ensuring dependable collision-free planning is the main goal.
USP: VLM + NLP
IIITH has created a lightweight vision-language model that integrates natural language processing and visual scene perception in order to accomplish this goal. The model processes the vehicle’s perspective view together encoded verbal commands to estimate goal locations in one-shot. These forecasts, however, can run counter to practical limitations. When told to “park behind the red car,” for instance, the algorithm may provide a spot that overlaps with the red car or is in a non-road area. They use a bespoke planner inside a neural network framework to enhance the perception module in order to overcome this difficulty. This necessitates that the planner be differentiable in order to allow gradient flow throughout the entire architecture during training, thus increasing the planning quality and prediction accuracy. Using a differentiable planner and an end-to-end training approach.