Publicação
From source code identifiers to natural language terms
| Resumo: | Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented. |
|---|---|
| Autores principais: | Carvalho, Nuno Ramos |
| Outros Autores: | Almeida, José João; Henriques, Pedro Rangel; Pereira, Maria João |
| Assunto: | Program comprehension Natural language processing Identifier splitting |
| Ano: | 2015 |
| País: | Portugal |
| Tipo de documento: | artigo |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Instituto Politécnico de Bragança |
| Idioma: | inglês |
| Origem: | Biblioteca Digital do IPB |
| _version_ | 1867172878714142720 |
|---|---|
| author | Carvalho, Nuno Ramos |
| author2 | Almeida, José João Henriques, Pedro Rangel Pereira, Maria João |
| author2_role | author author author |
| author_facet | Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João |
| author_role | author |
| contributor_name_str_mv | Biblioteca Digital do IPB |
| country_str | PT |
| creators_json_txt | [{\"Person.name\":\"Carvalho, Nuno Ramos\"},{\"Person.name\":\"Almeida, José João\"},{\"Person.name\":\"Henriques, Pedro Rangel\"},{\"Person.name\":\"Pereira, Maria João\",\"Person.identifier.orcid\":\"0000-0001-6323-0071\"}] |
| datacite.contributors.contributor.contributorName.fl_str_mv | Biblioteca Digital do IPB |
| datacite.creators.creator.creatorName.fl_str_mv | Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João |
| datacite.date.Accepted.fl_str_mv | 2015-01-01T00:00:00Z |
| datacite.date.available.fl_str_mv | 2015-01-15T12:46:09Z |
| datacite.date.embargoed.fl_str_mv | 2015-01-15T12:46:09Z |
| datacite.rights.fl_str_mv | http://purl.org/coar/access_right/c_abf2 |
| datacite.subjects.subject.fl_str_mv | Program comprehension Natural language processing Identifier splitting |
| datacite.titles.title.fl_str_mv | From source code identifiers to natural language terms |
| dc.contributor.none.fl_str_mv | Biblioteca Digital do IPB |
| dc.creator.none.fl_str_mv | Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João |
| dc.date.Accepted.fl_str_mv | 2015-01-01T00:00:00Z |
| dc.date.available.fl_str_mv | 2015-01-15T12:46:09Z |
| dc.date.embargoed.fl_str_mv | 2015-01-15T12:46:09Z |
| dc.format.none.fl_str_mv | application/pdf |
| dc.identifier.none.fl_str_mv | http://hdl.handle.net/10198/11577 |
| dc.language.none.fl_str_mv | eng |
| dc.publisher.none.fl_str_mv | Elsevier |
| dc.rights.none.fl_str_mv | http://purl.org/coar/access_right/c_abf2 |
| dc.subject.none.fl_str_mv | Program comprehension Natural language processing Identifier splitting |
| dc.title.fl_str_mv | From source code identifiers to natural language terms |
| dc.type.none.fl_str_mv | http://purl.org/coar/resource_type/c_6501 |
| description | Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented. |
| dirty | 0 |
| eu_rights_str_mv | openAccess |
| format | article |
| fulltext.url.fl_str_mv | https://bibliotecadigital.ipb.pt/bitstreams/978e6fe7-2c39-4cb2-bac1-8294d07c0041/download |
| funding.funder.alternateName_str_mv | FCT |
| funding.funder.identifier_str_mv | http://doi.org/10.13039/501100001871 |
| funding.funder.name_str_mv | Fundação para a Ciência e a Tecnologia |
| funding.identifier_str_mv | PEst-OE/EEI/UI0752/2014 |
| funding.name_str_mv | 6817 - DCRRNI ID |
| funding_str_mv | PEst-OE/EEI/UI0752/2014 info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/PEst-OE%2FEEI%2FUI0752%2F2014/PT |
| id | ipb_c2d1c8e32e05bc2db2ad41256d33a9d9 |
| identifier.url.fl_str_mv | http://hdl.handle.net/10198/11577 |
| instacron_str | ipb |
| institution | Instituto Politécnico de Bragança |
| instname_str | Instituto Politécnico de Bragança |
| language | eng |
| network_acronym_str | ipb |
| network_name_str | Biblioteca Digital do IPB |
| oai_identifier_str | oai:bibliotecadigital.ipb.pt:10198/11577 |
| organization_str_mv | urn:organizationAcronym:ipb |
| person_str_mv | Carvalho, Nuno Ramos Almeida, José João Henriques, Pedro Rangel Pereira, Maria João Pereira, Maria João https://www.ciencia-id.pt/C912-4A49-A3B3 C912-4A49-A3B3 http://orcid.org/0000-0001-6323-0071 0000-0001-6323-0071 |
| publishDate | 2015 |
| publisher.none.fl_str_mv | Elsevier |
| reponame_str | Biblioteca Digital do IPB |
| repository_id_str | urn:repositoryAcronym:ipb |
| service_str_mv | urn:repositoryAcronym:ipb |
| spelling | engElsevierporProgram comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.application/pdfporFrom source code identifiers to natural language termsCarvalho, Nuno RamosAlmeida, José JoãoHenriques, Pedro RangelPersonalPereira, Maria JoãoDSpacehttp://dspace.org/items/a20ccfa6-4e84-4c25-ab0d-8d6ba196ffc2DSpacehttp://dspace.org/items/a20ccfa6-4e84-4c25-ab0d-8d6ba196ffc2PereiraMaria JoãoCiência IDhttps://www.ciencia-id.ptC912-4A49-A3B3ORCIDhttp://orcid.org0000-0001-6323-0071Researcher IDhttps://www.researcherid.comG-5999-2011Scopus Author IDhttps://www.scopus.com13907870300HostingInstitutionOrganizationalBiblioteca Digital do IPBe-mailmailto:dspace@ipb.ptdspace@ipb.ptISSNIsPartOf0164-1212DOIIsPartOf10.1016/j.jss.2014.10.0132015-01-15T12:46:09Z20152015-01-01T00:00:00ZHandlehttp://hdl.handle.net/10198/11577http://purl.org/coar/access_right/c_abf2open accessProgram comprehensionNatural language processingIdentifier splitting1290046 bytesFundação para a Ciência e a TecnologiaStrategic Project - UI 752 - 2014info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/PEst-OE%2FEEI%2FUI0752%2F2014/PTPEst-OE/EEI/UI0752/20146817 - DCRRNI IDCrossref Funder IDhttp://doi.org/10.13039/501100001871literaturehttp://purl.org/coar/resource_type/c_6501journal articlehttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://bibliotecadigital.ipb.pt/bitstreams/978e6fe7-2c39-4cb2-bac1-8294d07c0041/downloadJournal of Systems and Software117128 |
| spellingShingle | From source code identifiers to natural language terms Carvalho, Nuno Ramos Program comprehension Natural language processing Identifier splitting |
| status | SINGLETON |
| subject.fl_str_mv | Program comprehension Natural language processing Identifier splitting |
| title | From source code identifiers to natural language terms |
| title_full | From source code identifiers to natural language terms |
| title_fullStr | From source code identifiers to natural language terms |
| title_full_unstemmed | From source code identifiers to natural language terms |
| title_short | From source code identifiers to natural language terms |
| title_sort | From source code identifiers to natural language terms |
| topic | Program comprehension Natural language processing Identifier splitting |
| topic_facet | Program comprehension Natural language processing Identifier splitting |
| url | http://hdl.handle.net/10198/11577 |
| visible | 1 |
Atividades financiadas
Carregando projetos financiados...