Publicação

Road Network Detection and Route Travel Time Estimation from Satellite Imagery

Detalhes bibliográficos
Resumo:	Accurately and quickly extracting road networks from high-resolution satellite images is essential for urban planning, disaster response, and autonomous navigation. Standard Convolutional Neural Networks (CNNs) are advantageous at performing semantic segmentation, but their focus on local features makes it hard for them to keep roads connected when there are obstacles like trees or building shadows. To address this, this thesis examines Hybrid Vision Transformers, focusing on the SegFormer (MiT-B3 encoder). SegFormer uses self-attention to capture global context and is a leading model in computer vision, though it has not been widely tested for mapping road networks. In this thesis this transformer model was compared to a state-of-the-art advanced CNN with dense dilated convolutions (DeepLabV3+ D3S2PP), which is designed for multi-scale context, and to the widely used ResNet50 U-Net, which serves as the main benchmark in this field for the chosen dataset. Most current evaluation methods use pixel-based metrics like Intersection over Union (IoU). However, IoU only measures how much area overlaps and does not account for road connectivity. For example, missing just one pixel can break a major road, making a route unusable, but this barely affects the IoU score. Since in GIS effective routing is a fundamental element, in this thesis a complete evaluation framework that goes beyond pixel accuracy is used. Graph-based metrics were used, including Average Path Length Similarity (APLS), the Weisfeiler-Lehman (WL) Kernel, to directly measure how well the model preserves road structure and connectivity. Also, a new width-based travel time metric was introduced to measure the real-world impact of topological mistakes. Experiments conducted across the SpaceNet 3 dataset reveal that the Hybrid Transformer achieves superior connectivity, significantly outperforming the ResNet baseline in structured cities like Las Vegas (APLS of 0.78 vs. 0.59). However, Transformers sometimes make confident mistakes, predicting false road connections in the background. To address gaps in road connections, a VGG19-based TopologyAware perceptual loss was added to the training process for all the evaluated models. This helped recover more road pixels (raising IoU by about 0.10 for all models), but it did not lead to improvement in actual routing, with APLS decreasing by less than 0.05 on average. This shows that recovering missing road areas and fixing key connection gaps are separate challenges. Post-processing strategies were also tested improving connectivity but also creating false connections at times and deleting real ones. A multi-city domain generalization analysis also found a major drop in performance when models trained on structured environments were tested on new, high-density urban areas. For example, in the dense and unstructured city of Mumbai, pixel-level detection was moderate (IoU about 0.40), but graph connectivity failed completely (APLS less than 0.01). This highlights the serious impact of domain gaps caused by vertical obstructions and different spectral signatures, showing key limitations in current transfer learning methods and pointing to the need for future research in adaptive topological road extraction.
Autores principais:	Malki, El Mehdi Gassa
Assunto:	Artificial Neural Network Deep Learning Satellite Imagery Road Network Extraction Remote Sensing Semantic Segmentation Graph Topology
Ano:	2026
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Descrição
Resumo:	Accurately and quickly extracting road networks from high-resolution satellite images is essential for urban planning, disaster response, and autonomous navigation. Standard Convolutional Neural Networks (CNNs) are advantageous at performing semantic segmentation, but their focus on local features makes it hard for them to keep roads connected when there are obstacles like trees or building shadows. To address this, this thesis examines Hybrid Vision Transformers, focusing on the SegFormer (MiT-B3 encoder). SegFormer uses self-attention to capture global context and is a leading model in computer vision, though it has not been widely tested for mapping road networks. In this thesis this transformer model was compared to a state-of-the-art advanced CNN with dense dilated convolutions (DeepLabV3+ D3S2PP), which is designed for multi-scale context, and to the widely used ResNet50 U-Net, which serves as the main benchmark in this field for the chosen dataset. Most current evaluation methods use pixel-based metrics like Intersection over Union (IoU). However, IoU only measures how much area overlaps and does not account for road connectivity. For example, missing just one pixel can break a major road, making a route unusable, but this barely affects the IoU score. Since in GIS effective routing is a fundamental element, in this thesis a complete evaluation framework that goes beyond pixel accuracy is used. Graph-based metrics were used, including Average Path Length Similarity (APLS), the Weisfeiler-Lehman (WL) Kernel, to directly measure how well the model preserves road structure and connectivity. Also, a new width-based travel time metric was introduced to measure the real-world impact of topological mistakes. Experiments conducted across the SpaceNet 3 dataset reveal that the Hybrid Transformer achieves superior connectivity, significantly outperforming the ResNet baseline in structured cities like Las Vegas (APLS of 0.78 vs. 0.59). However, Transformers sometimes make confident mistakes, predicting false road connections in the background. To address gaps in road connections, a VGG19-based TopologyAware perceptual loss was added to the training process for all the evaluated models. This helped recover more road pixels (raising IoU by about 0.10 for all models), but it did not lead to improvement in actual routing, with APLS decreasing by less than 0.05 on average. This shows that recovering missing road areas and fixing key connection gaps are separate challenges. Post-processing strategies were also tested improving connectivity but also creating false connections at times and deleting real ones. A multi-city domain generalization analysis also found a major drop in performance when models trained on structured environments were tested on new, high-density urban areas. For example, in the dense and unstructured city of Mumbai, pixel-level detection was moderate (IoU about 0.40), but graph connectivity failed completely (APLS less than 0.01). This highlights the serious impact of domain gaps caused by vertical obstructions and different spectral signatures, showing key limitations in current transfer learning methods and pointing to the need for future research in adaptive topological road extraction.