Publicação

From source code identifiers to natural language terms

Detalhes bibliográficos
Resumo:	Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.
Autores principais:	Carvalho, Nuno Ramos
Outros Autores:	Almeida, José João; Henriques, Pedro Rangel; Pereira, Maria João
Assunto:	Program comprehension Natural language processing Identifier splitting
Ano:	2015
País:	Portugal
Tipo de documento:	artigo
Tipo de acesso:	acesso aberto
Instituição associada:	Instituto Politécnico de Bragança
Idioma:	inglês
Origem:	Biblioteca Digital do IPB

_version_	1867172878714142720
author	Carvalho, Nuno Ramos
author2	Almeida, José João Henriques, Pedro Rangel Pereira, Maria João
author2_role	author author author
author_facet	Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João
author_role	author
contributor_name_str_mv	Biblioteca Digital do IPB
country_str	PT
creators_json_txt	[{\"Person.name\":\"Carvalho, Nuno Ramos\"},{\"Person.name\":\"Almeida, José João\"},{\"Person.name\":\"Henriques, Pedro Rangel\"},{\"Person.name\":\"Pereira, Maria João\",\"Person.identifier.orcid\":\"0000-0001-6323-0071\"}]
datacite.contributors.contributor.contributorName.fl_str_mv	Biblioteca Digital do IPB
datacite.creators.creator.creatorName.fl_str_mv	Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João
datacite.date.Accepted.fl_str_mv	2015-01-01T00:00:00Z
datacite.date.available.fl_str_mv	2015-01-15T12:46:09Z
datacite.date.embargoed.fl_str_mv	2015-01-15T12:46:09Z
datacite.rights.fl_str_mv	http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv	Program comprehension Natural language processing Identifier splitting
datacite.titles.title.fl_str_mv	From source code identifiers to natural language terms
dc.contributor.none.fl_str_mv	Biblioteca Digital do IPB
dc.creator.none.fl_str_mv	Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João
dc.date.Accepted.fl_str_mv	2015-01-01T00:00:00Z
dc.date.available.fl_str_mv	2015-01-15T12:46:09Z
dc.date.embargoed.fl_str_mv	2015-01-15T12:46:09Z
dc.format.none.fl_str_mv	application/pdf
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10198/11577
dc.language.none.fl_str_mv	eng
dc.publisher.none.fl_str_mv	Elsevier
dc.rights.none.fl_str_mv	http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv	Program comprehension Natural language processing Identifier splitting
dc.title.fl_str_mv	From source code identifiers to natural language terms
dc.type.none.fl_str_mv	http://purl.org/coar/resource_type/c_6501
description	Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.
dirty	0
eu_rights_str_mv	openAccess
format	article
fulltext.url.fl_str_mv	https://bibliotecadigital.ipb.pt/bitstreams/978e6fe7-2c39-4cb2-bac1-8294d07c0041/download
funding.funder.alternateName_str_mv	FCT
funding.funder.identifier_str_mv	http://doi.org/10.13039/501100001871
funding.funder.name_str_mv	Fundação para a Ciência e a Tecnologia
funding.identifier_str_mv	PEst-OE/EEI/UI0752/2014
funding.name_str_mv	6817 - DCRRNI ID
funding_str_mv	PEst-OE/EEI/UI0752/2014 info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/PEst-OE%2FEEI%2FUI0752%2F2014/PT
id	ipb_c2d1c8e32e05bc2db2ad41256d33a9d9
identifier.url.fl_str_mv	http://hdl.handle.net/10198/11577
instacron_str	ipb
institution	Instituto Politécnico de Bragança
instname_str	Instituto Politécnico de Bragança
language	eng
network_acronym_str	ipb
network_name_str	Biblioteca Digital do IPB
oai_identifier_str	oai:bibliotecadigital.ipb.pt:10198/11577
organization_str_mv	urn:organizationAcronym:ipb
person_str_mv	Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João Pereira, Maria João https://www.ciencia-id.pt/C912-4A49-A3B3 C912-4A49-A3B3 http://orcid.org/0000-0001-6323-0071 0000-0001-6323-0071
publishDate	2015
publisher.none.fl_str_mv	Elsevier
reponame_str	Biblioteca Digital do IPB
repository_id_str	urn:repositoryAcronym:ipb
service_str_mv	urn:repositoryAcronym:ipb
spelling	engElsevierporProgram comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.application/pdfporFrom source code identifiers to natural language termsCarvalho, Nuno RamosAlmeida, José JoãoHenriques, Pedro RangelPersonalPereira, Maria JoãoDSpacehttp://dspace.org/items/a20ccfa6-4e84-4c25-ab0d-8d6ba196ffc2DSpacehttp://dspace.org/items/a20ccfa6-4e84-4c25-ab0d-8d6ba196ffc2PereiraMaria JoãoCiência IDhttps://www.ciencia-id.ptC912-4A49-A3B3ORCIDhttp://orcid.org0000-0001-6323-0071Researcher IDhttps://www.researcherid.comG-5999-2011Scopus Author IDhttps://www.scopus.com13907870300HostingInstitutionOrganizationalBiblioteca Digital do IPBe-mailmailto:dspace@ipb.ptdspace@ipb.ptISSNIsPartOf0164-1212DOIIsPartOf10.1016/j.jss.2014.10.0132015-01-15T12:46:09Z20152015-01-01T00:00:00ZHandlehttp://hdl.handle.net/10198/11577http://purl.org/coar/access_right/c_abf2open accessProgram comprehensionNatural language processingIdentifier splitting1290046 bytesFundação para a Ciência e a TecnologiaStrategic Project - UI 752 - 2014info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/PEst-OE%2FEEI%2FUI0752%2F2014/PTPEst-OE/EEI/UI0752/20146817 - DCRRNI IDCrossref Funder IDhttp://doi.org/10.13039/501100001871literaturehttp://purl.org/coar/resource_type/c_6501journal articlehttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://bibliotecadigital.ipb.pt/bitstreams/978e6fe7-2c39-4cb2-bac1-8294d07c0041/downloadJournal of Systems and Software117128
spellingShingle	From source code identifiers to natural language terms Carvalho, Nuno Ramos Program comprehension Natural language processing Identifier splitting
status	SINGLETON
subject.fl_str_mv	Program comprehension Natural language processing Identifier splitting
title	From source code identifiers to natural language terms
title_full	From source code identifiers to natural language terms
title_fullStr	From source code identifiers to natural language terms
title_full_unstemmed	From source code identifiers to natural language terms
title_short	From source code identifiers to natural language terms
title_sort	From source code identifiers to natural language terms
topic	Program comprehension Natural language processing Identifier splitting
topic_facet	Program comprehension Natural language processing Identifier splitting
url	http://hdl.handle.net/10198/11577
visible	1

Atividades financiadas

Carregando projetos financiados...

Publicação

From source code identifiers to natural language terms

Atividades financiadas

Registos relacionados