Update: SDV changed their license model in 2023, and is NOT open-source anymore.
As businesses attempt to extract relevant insights and build powerful machine-learning models, the need for high-quality, accurate synthetic datasets has grown. MOSTLY AI is excited to present our latest findings. In this blogpost we will present the results of an experiment comparing synthetic data generated by MOSTLY AI and by one of the most popular open-source synthetic data generators (SDV) and evaluating synthetic data quality by building a machine learning model using the resulting synthetic data.
What sets MOSTLY AI apart?
Our synthetic data generation method combines the most recent advances in Generative AI with a thorough grasp of data protection and compliance. We ensure that every synthetic dataset created by MOSTLY AI maintains the statistical properties of the original data, keeping its authenticity while securing sensitive information, by leveraging state-of-the-art algorithms and models.
During our search for ways to improve our synthetic data generation, we came across a post written by Sean Owen on the Databricks blog. The post described the use of Synthetic Data Vault (SDV) to generate synthetic datasets. We were curious to see how MOSTLY AI compares to SDV and decided to conduct a study to compare the performance of our solution to SDV’s.
The sample data
In our evaluation process, we followed a systematic approach. First, we acquired the dataset mentioned in the article, ensuring we had a reliable benchmark for comparison. The data is available in Databricks in ‘/databricks-datasets/nyctaxi/tables/nyctaxi_yellow’. It is the well-known NYC Taxi dataset. Over a decade, it has collected basic information regarding taxi journeys in New York City, such as pickup and drop-off locations, distances, fares, tolls, and tips.
Next, we employed both Synthetic Data Vault (SDV) and MOSTLY AI’s synthetic data generator to synthesize 80% of the dataset, aiming to capture its characteristics and patterns accurately. To establish a fair evaluation, we set aside the remaining 20% as a holdout for testing and validation purposes. This step allowed us to thoroughly assess the performance of our synthetic dataset against SDV’s results.
Synthetic data – Quality evaluation
To evaluate the quality and accuracy of the synthetic data generated by both MOSTLY AI and SDV, we employed two different measurement metrics. According to the MOSTLY AI QA report, our synthetic dataset achieved an accuracy of 96%. In contrast, SDV’s performance was measured at 40% accuracy, highlighting a significant disparity in the results. Additionally, when assessing the quality scores using SDV’s Quality Report, MOSTLY AI’s synthetic dataset received a rating of 97%, which indicates high adherence to real-world distributions and statistical characteristics. SDV achieved a quality score of 77%.
Evaluation by building an ML model
In the final stage of our evaluation, we constructed a regression model using LightGBM, mirroring the methodology employed in the referenced blog post. Essentially, the aim is to build a regression model trying to predict the tip amount that a customer is more likely to offer to the taxi driver. The holdout set served as the test bed for assessing the predictive performance of the models trained on the original dataset, as well as the synthetic datasets generated by MOSTLY AI and SDV. Notably, the original data achieved an RMSE (Root Mean Square Error) of 0.99, demonstrating its strong predictive capability. The synthetic dataset produced by MOSTLY AI is closely trailed with an RMSE of 1.00, which affirms its ability to approximate the original data distribution accurately. In contrast, the SDV synthetic dataset yielded a higher RMSE of 1.64, indicating a larger deviation from the original dataset’s predictive performance.
In comparison to the results reported in the blog post, where an RMSE of 1.52 was achieved, our evaluation showcases significant improvement. With an RMSE of 1.00, the synthetic dataset generated by MOSTLY AI demonstrates much better performance and comes remarkably close to the accuracy of the original data. We also conducted experiments using SDV’s more advanced algorithm, TVAE, which resulted in an RMSE of 1.06. Although SDV’s TVAE algorithm performed competitively, our synthetic data outperformed it.
In our evaluation comparing synthetic datasets generated by MOSTLY AI and SDV, it is evident that MOSTLY AI’s solution surpasses the competition in terms of accuracy and quality. With our synthetic dataset achieving an RMSE of 1.00, closely approaching the performance of the original data, we have demonstrated the high precision and fidelity of our synthetic data generation capabilities. Notably, our synthetic data outperformed both SDV’s standard algorithm and its more advanced TVAE algorithm.
By leveraging synthetic data, organizations can benefit from a multitude of advantages. Firstly, the high accuracy and quality of our synthetic datasets ensure reliable model training and testing, enabling data scientists to develop robust machine-learning models without relying solely on the original data. Secondly, synthetic data minimize privacy concerns, as sensitive information is replaced with synthesized yet statistically representative values. This enables organizations to comply with stringent data privacy regulations while still harnessing the power of data-driven insights.
As always we are more than happy to introduce you to our platform. Get hands-on with synthetic data generation and register an account to generate 100K rows of synthetic data daily for free. If you would like to use MOSTLY AI’S synthetic data generator in an enterprise environment, get in touch and we’ll be happy to help!