MOSTLY AI: The Most Accurate Synthetic Data Generator

Table of Contents

Achieve Fast and Cost-Effective Fine-Tuned LLM Inference with LoRA Exchange (LoRAX)

As organizations strive to derive valuable insights and build robust machine learning models, the demand for high-quality synthetic datasets continues to rise. MOSTLY AI is excited to share our latest findings in this blog post, where we’ll compare the synthetic data generated by MOSTLY AI against that produced by the popular open-source generator, Synthetic Data Vault (SDV). We will evaluate the quality of synthetic data by building machine learning models using both datasets.

What Sets MOSTLY AI Apart?

Our approach to synthetic data generation integrates cutting-edge Generative AI techniques with a deep understanding of data privacy and compliance regulations. We ensure that every synthetic dataset produced maintains the statistical integrity of the original data, thereby protecting sensitive information. By utilizing state-of-the-art algorithms and models, MOSTLY AI delivers high-quality synthetic data.

Recently, we explored enhancements to our synthetic data generation system after reading a post by Sean Owen on the Databricks blog about SDV. Curious about how MOSTLY AI compared to SDV, we decided to perform a study to evaluate the effectiveness of our solution.

The Sample Data

To conduct our evaluation, we needed a reliable benchmark. We utilized the NYC Taxi dataset, which is publicly available through Databricks at the following location:
/databricks-datasets/nyctaxi/tables/nyctaxi_yellow. This dataset comprises basic information on taxi rides in New York City over a decade, including pickup and drop-off locations, distances, fares, tolls, and tips.

Both MOSTLY AI and SDV synthesized 80% of this dataset to maintain its characteristics and patterns effectively. The remaining 20% served as a holdout for testing and validation, allowing us to assess the performance of the synthetic datasets against each other.

Evaluating Synthetic Data Quality

To gauge the quality of the synthetic data produced by both MOSTLY AI and SDV, we employed two different metrics. According to our quality assurance report, the synthetic dataset from MOSTLY AI achieved an impressive accuracy of 96%, while SDV performed at only 40%. This stark contrast highlights the superior results obtained with MOSTLY AI’s approach.

Furthermore, using SDV’s Quality Report, we discovered that our synthetic dataset received a quality score of 97%, indicating a high degree of fidelity to real-world distributions and statistical characteristics, whereas SDV achieved a score of 77%.

Model Evaluation

In the final phase of our analysis, we constructed a regression model using LightGBM, similar to the methodology referenced in the initial blog post. The goal was to predict the tip amount likely to be given by a passenger to the taxi driver. The holdout dataset was used to evaluate the predictive performance of models trained on the original dataset alongside those trained on the synthetic datasets generated by both MOSTLY AI and SDV.

The original dataset produced a Root Mean Square Error (RMSE) of 0.99, demonstrating strong predictive capabilities. The synthetic dataset generated by MOSTLY AI closely followed, achieving an RMSE of 1.00 and showcasing its ability to accurately mirror the original dataset. In contrast, the SDV synthetic dataset yielded an RMSE of 1.64, indicating a significant deviation from the original dataset’s predictive performance.

Compared to another outcome reported in the earlier blog post, which had an RMSE of 1.52, our evaluation indicates considerable improvement. With an RMSE of 1.00, MOSTLY AI’s synthetic dataset closely approached the accuracy of the original data and outperformed SDV’s more advanced algorithm, TVAE, which resulted in an RMSE of 1.06.

Conclusion

In our analysis comparing synthetic datasets generated by MOSTLY AI and SDV, it is clear that MOSTLY AI’s solution excels in both accuracy and quality. Our synthetic dataset, with an RMSE of 1.00, closely approximates the original data, demonstrating our high precision and fidelity in synthetic data generation. Notably, our output outperformed both SDV’s standard and advanced TVAE algorithms.

The advantages of synthetic data are numerous. Not only does our high-quality synthetic dataset ensure reliable training and testing for machine learning models, but it also alleviates privacy concerns. By replacing sensitive information with statistically representative values, organizations can comply with stringent data privacy regulations while leveraging the benefits of data-driven insights.

We invite you to explore our platform further. Register today to generate up to 100,000 rows of synthetic data daily for free. Join our upcoming webinar to see LoRAX in action and gain access to our free Colab notebook.

Let me know if you need any further changes or additional information!