Publicação

Transformer-Based Visual Odometry and Depth Estimation for Wireless Capsule Endoscopy

Detalhes bibliográficos
Resumo:	AccurateposeanddepthestimationforWirelessCapsuleEndoscopy(WCE) remainsasignificantchallenge due to theunstructuredand texture-poor natureof thegastrointestinal (GI) tract. Unlike traditional en doscopy,WCEisaminimally invasiveprocedure, providingacomprehensiveviewof theGI tract througha small, ingestiblecapsuleequippedwithacamera.However,thistechnologicaladvancementintroducesunique challenges inaccuratelydeterminingthecapsule’sposeanddepthduetotheabsenceofexternal localization mechanisms suchasGPSormagnetic tracking. These limitationsmakevisual odometry (VO) techniques critical forenablingautonomousnavigation,3Dmapping,andlocalizationinWCEapplications. Inthis thesis,we investigatetheuseof transformer-basedarchitectures for self-supervisedmonoculardepth andposeestimation inWCE.Traditional visual odometrymethods,whichrelyonfeature-basedtechniques suchaskeypointdetection,opticalflow,or featurematching,oftenstruggle inthecomplexanddeformable GI environment. Thesemethods facesignificantchallengesduetothe low-texturesurfaces, dynamictissue deformations, specular reflections, andpoor illuminationconditions inherent toendoscopic imagery. Toad dress these limitations,weusedthestrengthsof transformer-basedarchitectures,whichhavedemonstrated remarkablesuccessinvariouscomputervisiontasksduetotheirabilitytomodelglobalandlocaldependencies effectively. Inourfirstframework,weproposed,thatEndoTranSfm,employsthePyramidVisionTransformerv2(PVTv2) asthebackboneforbothposeanddepthestimation.ThePVTv2architectureintroducesahierarchicalstruc ture,enablingmulti-scalefeatureextractioncritical fordensepredictiontasks likedepthandposeestimation. EndoTranSfmutilizesacombinationof losses, includingbrightness-awarephotometric loss,geometryconsis tency loss, andsmoothness loss, totrainthemodel inaself-supervisedmannerwithout requiring labelled groundtruthdata.Thisframeworkdemonstratescompetitiveperformancecomparedtotheexistingstate-of the-artEndoSfMLearner,particularlyinscenariosinvolvingrepetitivetexturesandvariablelightingconditions, wheretraditionalmethodstypicallyfail. Inthesecondapproach,weexploredtheintegrationoftheSwinTransformer,whichintroducesseveralarchi tectural innovations.TheSwinTransformerusesashiftedwindow-basedself-attentionmechanism,enablingit toefficientlyhandlehigh-resolutionimageswhilecapturingbothlocalandglobalcontextual information.We integratedthespatial attentionmodule intoenhancedversionsof theSwin-basedmodel,whichallowedthe networktofocusonanatomicallysignificantregions,therebyimprovingtheaccuracyofdepthandposeestima tion.Theframeworksareevaluatedonbenchmarkdatasets, includingEndoSLAM,SimCol,andColonDepth, whichprovidesyntheticandreal-worlddataforWCEapplications. Comparativeevaluationsdemonstratethe strengthsoftransformer-basedapproachesingeneralizingacrossdifferentdatasets.Notably,thesemodelsare comparedwithstate-of-the-artperformanceinvariousmetrics, includingrotationalaccuracy, relativetransla tionerror,anddepthprediction.Theresultsof thisresearchreveal thepotentialof transformerarchitectures inovercomingthechallengesofWCElocalization.Transformer-basedmethodsareinherentlyrobusttocom plexlightingconditions, repetitivetextures,andlarge-scalevariations intheGI tract.Additionally, theuseof self-supervised learningeliminates thedependencyonextensive labelleddatasets,which isoneof themajor problems inmedical imagingresearch.
Autores principais:	Nazifi, Nahid
Assunto:	Transformers Visual Odometry Self Supervised Deep learning Pose Estimation Depth Estimation
Ano:	2025
País:	Portugal
Tipo de documento:	tese de doutoramento
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade de Coimbra
Idioma:	inglês
Origem:	Estudo Geral - Universidade de Coimbra

Descrição
Resumo:	AccurateposeanddepthestimationforWirelessCapsuleEndoscopy(WCE) remainsasignificantchallenge due to theunstructuredand texture-poor natureof thegastrointestinal (GI) tract. Unlike traditional en doscopy,WCEisaminimally invasiveprocedure, providingacomprehensiveviewof theGI tract througha small, ingestiblecapsuleequippedwithacamera.However,thistechnologicaladvancementintroducesunique challenges inaccuratelydeterminingthecapsule’sposeanddepthduetotheabsenceofexternal localization mechanisms suchasGPSormagnetic tracking. These limitationsmakevisual odometry (VO) techniques critical forenablingautonomousnavigation,3Dmapping,andlocalizationinWCEapplications. Inthis thesis,we investigatetheuseof transformer-basedarchitectures for self-supervisedmonoculardepth andposeestimation inWCE.Traditional visual odometrymethods,whichrelyonfeature-basedtechniques suchaskeypointdetection,opticalflow,or featurematching,oftenstruggle inthecomplexanddeformable GI environment. Thesemethods facesignificantchallengesduetothe low-texturesurfaces, dynamictissue deformations, specular reflections, andpoor illuminationconditions inherent toendoscopic imagery. Toad dress these limitations,weusedthestrengthsof transformer-basedarchitectures,whichhavedemonstrated remarkablesuccessinvariouscomputervisiontasksduetotheirabilitytomodelglobalandlocaldependencies effectively. Inourfirstframework,weproposed,thatEndoTranSfm,employsthePyramidVisionTransformerv2(PVTv2) asthebackboneforbothposeanddepthestimation.ThePVTv2architectureintroducesahierarchicalstruc ture,enablingmulti-scalefeatureextractioncritical fordensepredictiontasks likedepthandposeestimation. EndoTranSfmutilizesacombinationof losses, includingbrightness-awarephotometric loss,geometryconsis tency loss, andsmoothness loss, totrainthemodel inaself-supervisedmannerwithout requiring labelled groundtruthdata.Thisframeworkdemonstratescompetitiveperformancecomparedtotheexistingstate-of the-artEndoSfMLearner,particularlyinscenariosinvolvingrepetitivetexturesandvariablelightingconditions, wheretraditionalmethodstypicallyfail. Inthesecondapproach,weexploredtheintegrationoftheSwinTransformer,whichintroducesseveralarchi tectural innovations.TheSwinTransformerusesashiftedwindow-basedself-attentionmechanism,enablingit toefficientlyhandlehigh-resolutionimageswhilecapturingbothlocalandglobalcontextual information.We integratedthespatial attentionmodule intoenhancedversionsof theSwin-basedmodel,whichallowedthe networktofocusonanatomicallysignificantregions,therebyimprovingtheaccuracyofdepthandposeestima tion.Theframeworksareevaluatedonbenchmarkdatasets, includingEndoSLAM,SimCol,andColonDepth, whichprovidesyntheticandreal-worlddataforWCEapplications. Comparativeevaluationsdemonstratethe strengthsoftransformer-basedapproachesingeneralizingacrossdifferentdatasets.Notably,thesemodelsare comparedwithstate-of-the-artperformanceinvariousmetrics, includingrotationalaccuracy, relativetransla tionerror,anddepthprediction.Theresultsof thisresearchreveal thepotentialof transformerarchitectures inovercomingthechallengesofWCElocalization.Transformer-basedmethodsareinherentlyrobusttocom plexlightingconditions, repetitivetextures,andlarge-scalevariations intheGI tract.Additionally, theuseof self-supervised learningeliminates thedependencyonextensive labelleddatasets,which isoneof themajor problems inmedical imagingresearch.