Publicação

Synthetic data generation from JSON/XML schemas

Ver documento

Detalhes bibliográficos
Resumo:The objective of this dissertation is the development of an application capable of automatically generating synthetic datasets that are representative and, possibly, very large, directly from JSON and XML schemas, in order to facilitate the testing of software applications and scientific endeavors in areas such as Data Science or Application Development. For this purpose, it is intended to develop a new version of DataGen, an online open-source application that allows the quick prototyping of datasets through its own Domain Specific Language (DSL) of specification of data models. DataGen is able to parse these models and generate synthetic datasets according to the structural and semantic restrictions stipulated, automating the whole process of data generation with spontaneous values created in runtime and/or from a library of support datasets. The objective of this new product, DataGen From Schemas, is to expand DataGen’s use cases and raise the datasets specification’s abstraction level, making it possible to generate synthetic datasets directly from schemas. This new platform builds upon its prior version and acts as its complement, operating jointly and sharing the same data layer, in order to assure the compatibility of both platforms and the portability of the created DSL models between them. Its purpose is to parse schema files and generate corresponding DSL models, effectively translating the JSON or XML specification to a DataGen model, then using the original application as a middleware to generate the final datasets. The present dissertation details the entire creative process behind the development of this application: firstly, it frames the topic of study and its initial phase of investigation, debating relevant technologies and existing related work; then, the ideation phase of the product is addressed, projecting an adequate arquitecture and the reasons behind its design choices, as well as surveying technical requirements for DataGen From Schemas, while taking into account the conclusions reached through prior research; afterwards, the development phase is covered, carefully explaining the elaborated components, their properties and the data flow between them, for both the JSON and XML modules; finally, the reader is presented with conclusions taken from this project’s development and possible future work to implement, in order to improve the current solution.
Autores principais:Cardoso, Hugo André Coelho
Assunto:Schemas JSON XML Data generation Synthetic data DataGen DSL Dataset Grammar Randomization Open source Data science REST API PEG.js Geração de Dados Dados sintéticos Gramática Aleatoriedade Ciência de dados
Ano:2022
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade do Minho
Idioma:inglês
Origem:RepositóriUM - Universidade do Minho
_version_ 1866877484487671808
author Cardoso, Hugo André Coelho
author_facet Cardoso, Hugo André Coelho
author_role author
contributor_name_str_mv Ramalho, José Carlos
Universidade do Minho
country_str PT
creators_json_txt [{\"Person.name\":\"Cardoso, Hugo André Coelho\"}]
datacite.contributors.contributor.contributorName.fl_str_mv Ramalho, José Carlos
Universidade do Minho
datacite.creators.creator.creatorName.fl_str_mv Cardoso, Hugo André Coelho
datacite.date.Accepted.fl_str_mv 2022-12-19T00:00:00Z
datacite.date.available.fl_str_mv 2023-05-16T11:14:56Z
datacite.date.embargoed.fl_str_mv 2023-05-16T11:14:56Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv Schemas
JSON
XML
Data generation
Synthetic data
DataGen
DSL
Dataset
Grammar
Randomization
Open source
Data science
REST API
PEG.js
Geração de Dados
Dados sintéticos
Gramática
Aleatoriedade
Ciência de dados
datacite.titles.title.fl_str_mv Synthetic data generation from JSON/XML schemas
dc.contributor.none.fl_str_mv Ramalho, José Carlos
Universidade do Minho
dc.creator.none.fl_str_mv Cardoso, Hugo André Coelho
dc.date.Accepted.fl_str_mv 2022-12-19T00:00:00Z
dc.date.available.fl_str_mv 2023-05-16T11:14:56Z
dc.date.embargoed.fl_str_mv 2023-05-16T11:14:56Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv https://hdl.handle.net/1822/84498
dc.language.none.fl_str_mv eng
dc.rights.cclincense.fl_str_mv http://creativecommons.org/licenses/by-nc/4.0/
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.rights.copyright.fl_str_mv openAccess
dc.subject.none.fl_str_mv Schemas
JSON
XML
Data generation
Synthetic data
DataGen
DSL
Dataset
Grammar
Randomization
Open source
Data science
REST API
PEG.js
Geração de Dados
Dados sintéticos
Gramática
Aleatoriedade
Ciência de dados
dc.title.fl_str_mv Synthetic data generation from JSON/XML schemas
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_bdcc
description The objective of this dissertation is the development of an application capable of automatically generating synthetic datasets that are representative and, possibly, very large, directly from JSON and XML schemas, in order to facilitate the testing of software applications and scientific endeavors in areas such as Data Science or Application Development. For this purpose, it is intended to develop a new version of DataGen, an online open-source application that allows the quick prototyping of datasets through its own Domain Specific Language (DSL) of specification of data models. DataGen is able to parse these models and generate synthetic datasets according to the structural and semantic restrictions stipulated, automating the whole process of data generation with spontaneous values created in runtime and/or from a library of support datasets. The objective of this new product, DataGen From Schemas, is to expand DataGen’s use cases and raise the datasets specification’s abstraction level, making it possible to generate synthetic datasets directly from schemas. This new platform builds upon its prior version and acts as its complement, operating jointly and sharing the same data layer, in order to assure the compatibility of both platforms and the portability of the created DSL models between them. Its purpose is to parse schema files and generate corresponding DSL models, effectively translating the JSON or XML specification to a DataGen model, then using the original application as a middleware to generate the final datasets. The present dissertation details the entire creative process behind the development of this application: firstly, it frames the topic of study and its initial phase of investigation, debating relevant technologies and existing related work; then, the ideation phase of the product is addressed, projecting an adequate arquitecture and the reasons behind its design choices, as well as surveying technical requirements for DataGen From Schemas, while taking into account the conclusions reached through prior research; afterwards, the development phase is covered, carefully explaining the elaborated components, their properties and the data flow between them, for both the JSON and XML modules; finally, the reader is presented with conclusions taken from this project’s development and possible future work to implement, in order to improve the current solution.
dirty 0
eu_rights_str_mv openAccess
format masterThesis
fulltext.url.fl_str_mv https://prod-dspace.uminho.pt/bitstreams/612613a2-c689-49d2-8205-b4a238edd5b1/download
id rum_4b4bd3979aafa17cd0cdc8ec28dc747f
identifier.url.fl_str_mv https://hdl.handle.net/1822/84498
instacron_str repositorium
institution Universidade do Minho
instname_str Universidade do Minho
language eng
network_acronym_str rum
network_name_str RepositóriUM - Universidade do Minho
oai_identifier_str oai:repositorium.uminho.pt:1822/84498
organization_str_mv urn:organizationAcronym:repositorium
person_str_mv Cardoso, Hugo André Coelho
publishDate 2022
reponame_str RepositóriUM - Universidade do Minho
repository_id_str urn:repositoryAcronym:rum
service_str_mv urn:repositoryAcronym:rum
spelling engporThe objective of this dissertation is the development of an application capable of automatically generating synthetic datasets that are representative and, possibly, very large, directly from JSON and XML schemas, in order to facilitate the testing of software applications and scientific endeavors in areas such as Data Science or Application Development. For this purpose, it is intended to develop a new version of DataGen, an online open-source application that allows the quick prototyping of datasets through its own Domain Specific Language (DSL) of specification of data models. DataGen is able to parse these models and generate synthetic datasets according to the structural and semantic restrictions stipulated, automating the whole process of data generation with spontaneous values created in runtime and/or from a library of support datasets. The objective of this new product, DataGen From Schemas, is to expand DataGen’s use cases and raise the datasets specification’s abstraction level, making it possible to generate synthetic datasets directly from schemas. This new platform builds upon its prior version and acts as its complement, operating jointly and sharing the same data layer, in order to assure the compatibility of both platforms and the portability of the created DSL models between them. Its purpose is to parse schema files and generate corresponding DSL models, effectively translating the JSON or XML specification to a DataGen model, then using the original application as a middleware to generate the final datasets. The present dissertation details the entire creative process behind the development of this application: firstly, it frames the topic of study and its initial phase of investigation, debating relevant technologies and existing related work; then, the ideation phase of the product is addressed, projecting an adequate arquitecture and the reasons behind its design choices, as well as surveying technical requirements for DataGen From Schemas, while taking into account the conclusions reached through prior research; afterwards, the development phase is covered, carefully explaining the elaborated components, their properties and the data flow between them, for both the JSON and XML modules; finally, the reader is presented with conclusions taken from this project’s development and possible future work to implement, in order to improve the current solution.application/pdfporSynthetic data generation from JSON/XML schemasCardoso, Hugo André CoelhoRamalho, José CarlosHostingInstitutionOrganizationalUniversidade do Minhoe-mailmailto:repositorium@usdb.uminho.ptrepositorium@usdb.uminho.ptURNurn:tid:2032625492023-05-16T11:14:56Z2022-12-192022-102022-12-19T00:00:00ZHandlehttps://hdl.handle.net/1822/84498http://purl.org/coar/access_right/c_abf2open accessSchemasJSONXMLData generationSynthetic dataDataGenDSLDatasetGrammarRandomizationOpen sourceData scienceREST APIPEG.jsGeração de DadosDados sintéticosGramáticaAleatoriedadeCiência de dados4721790 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesis2022-12-19http://creativecommons.org/licenses/by-nc/4.0/openAccesshttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://prod-dspace.uminho.pt/bitstreams/612613a2-c689-49d2-8205-b4a238edd5b1/download
spellingShingle Synthetic data generation from JSON/XML schemas
Cardoso, Hugo André Coelho
Schemas
JSON
XML
Data generation
Synthetic data
DataGen
DSL
Dataset
Grammar
Randomization
Open source
Data science
REST API
PEG.js
Geração de Dados
Dados sintéticos
Gramática
Aleatoriedade
Ciência de dados
status SINGLETON
subject.fl_str_mv Schemas
JSON
XML
Data generation
Synthetic data
DataGen
DSL
Dataset
Grammar
Randomization
Open source
Data science
REST API
PEG.js
Geração de Dados
Dados sintéticos
Gramática
Aleatoriedade
Ciência de dados
title Synthetic data generation from JSON/XML schemas
title_full Synthetic data generation from JSON/XML schemas
title_fullStr Synthetic data generation from JSON/XML schemas
title_full_unstemmed Synthetic data generation from JSON/XML schemas
title_short Synthetic data generation from JSON/XML schemas
title_sort Synthetic data generation from JSON/XML schemas
topic Schemas
JSON
XML
Data generation
Synthetic data
DataGen
DSL
Dataset
Grammar
Randomization
Open source
Data science
REST API
PEG.js
Geração de Dados
Dados sintéticos
Gramática
Aleatoriedade
Ciência de dados
topic_facet Schemas
JSON
XML
Data generation
Synthetic data
DataGen
DSL
Dataset
Grammar
Randomization
Open source
Data science
REST API
PEG.js
Geração de Dados
Dados sintéticos
Gramática
Aleatoriedade
Ciência de dados
url https://hdl.handle.net/1822/84498
visible 1