

In terms of architecture, they simultaneously train two neural networks in an adversarial fashion: a generator and a discriminator, both trying to outperform each other. GANs come from the field of unsupervised training and the generative family.

If, for example, the reconstruction error puts too much emphasis on getting the continuous parts of the data right, the quality of the categorical parts might suffer. As your original data becomes more heterogeneous (e.g., mix of categorical, binary, continuous), it also becomes more difficult to formulate a reconstruction error that works well on all data components. Their weak point, however, lies in their training objective. They are relatively easy to implement and to train. VAEs are a straightforward approach to solve the transformation problem. Process of generating synthetic data with VAEs Minimizing this error is the objective of the VAE training and what turns it into the desired transformation function, while an additional regularization objective controls the shape of the latent distribution. This double transformation, encoded-decoded, appears cumbersome at first glance but is necessary to formulate a quantifiable reconstruction error. A decoder network then transforms the distribution back to the original space. At first, an encoder network transforms an original complex distribution into a latent distribution. As generative models, they are designed to learn the underlying distribution of original data and are very efficient at generating complex models. VAEs come from the field of unsupervised training and the autoencoder family. Variational Autoencoders and Generative Adversarial Networks are two commonly-used architectures in the field of synthetic data generation. Through prediction and correction, Neural Network learns to reproduce the data and generalize beyond it to produce a representation that could have originated the data, making them particularly well-suited for synthetic data generation. In the last few years, advancements in machine learning have put a variety of deep models in our hands that can learn a wide range of data types. It is why we need a more robust model to tackle the complexity of the data. At some point, you might just lack data points to learn the distribution properly. The more columns you add, the more combinations appear. However, the more complex the dataset, the more difficult it is to map dependencies correctly. Theoretically, with a simple table and very few columns, a very simplistic model mapping joint distribution can be a fast and easy way to get synthetic data. Generating synthetic data comes down to learning the joint probability distribution in an original dataset to generate a new dataset with the same distribution. The data science team modeled tabular synthetic data after real-life customer data that was too sensitive to use and trained their machine learning models with the synthetic data. In the field of insurance, Swiss company La Mobilière used synthetic data to train churn prediction models. Synthetic data can function as a drop-in replacement for any type of behavior, predictive, or transactional analysis. It could be anything ranging from a patient database to users' analytical behavior information or financial logs. Tabular synthetic data refers to artificially generated data that mimics real-life data stored in tables. This way, they can include more complex and varied scenarios instead of spending significant time and resources to obtain observations. Alphabet's subsidiary company uses these datasets to train its self-driving vehicle systems. It can turn particularly helpful if you need to augment the database of a vision recognition system, for example.įor over a year now, the Waymo team has been generating realistic driving datasets from synthetic data. This similarity allows using the synthetic media as a drop-in replacement for the original data. You artificially render media with properties close-enough to real-life data.

Synthetic data can also be synthetic video, image, or sound. In the field of natural language processing, Amazon’s Alexa AI team uses synthetic data to complete the training data of its natural language understanding (NLU) system. Today, machine learning models allow the conception of remarkably performant natural language generation systems to build and train a model to generate text. Synthetic data can be artificially-generated text.
