sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

594
active users

#syntheticdata

2 posts2 participants0 posts today

I don't think people understand synthetic data. Sure, some people get it that with human-generated data and imitative models you're only asymptotically approaching the human level.

And natural data is not necessarily the best data to train AIs with, as far as you consider the density and fidelity of knowledge and task related skills being presented.

What if you train your model with natural data, and it still makes errors when deployed? What lever do you pull? Collect more natural data and hope for the best? There has never been a satisfying and a scalable answer to this.

What if you add a small indirection, and use synthetic data instead? You have instructions or conditioning data you use to produce your synthetic training corpuses. You can very trivially incorporate these error cases into your synthetic data generator!

You then actually have the levers you need to make the errors disappear, without having to hit your head against an immovable object, real data, repeatedly.

This, in addition to the fact that you can produce your synthetic data generation instructions from real data, but sidestep the whole personally identifiable data issue as you'd only extract the meaningful knowledge in an enriched form from the real data instead of blindly doing the censor work of a last century East German bureaucrat to massive volumes of irrelevant data.

Make your AIs write textbooks on the tasks you want them to master. Make them synthesize training data based on these textbooks. You can then handle the errors better and you don't need to worry about leaking personal data. After all, that is how humans master skills as well.

Can AI Be Trained on Data Generated by Other AI? Exploring the Potential and Pitfalls of Synthetic Training Data
AI-generated training data is revolutionizing AI model training! Synthetic data simulates real-world scenarios, offering a more efficient approach. Companies like Anthropic are already using it. Learn more about this exciting new frontier! #SyntheticData #AIGeneration #AItraining #DataScience #MachineLearning #FutureofAI
tech-champion.com/data-science...

A Field Guide to Rapidly Improving AI Products – O’Reilly

This article subverts traditional tools-centric AI development by revealing how a focus on qualitative error analysis can uncover actionable, domain-specific weaknesses.

Its analysis, addresses both strategic and operational challenges while acknowledging the evolution of evaluation criteria in AI systems.

oreilly.com/radar/a-field-guid

O’Reilly Media · A Field Guide to Rapidly Improving AI ProductsEvaluation Methods, Data-Driven Improvement, and Experimentation Techniques from 30+ Production Implementations

Can anyone advise on something #ai ? We are looking for a way to generate synthetic image data from existing images, looking for - few tens of thousands of iterations. Any suggestions for a product / service or small model that might work? Thank you! #research #syntheticdata

Continued thread

Synthetic data generation with GPT-4o was a game changer for us. By creating datasets with common misspellings and syntactic variations, we were able to enhance the robustness of our search models significantly. This crucial step ensured that our AI models could handle a variety of real-world inputs seamlessly. #SyntheticData #Innovation