Ask HN: Is synthetic data generation practical outside academia?

4 points | by cpard 14 hours ago ago

5 comments

publicdaniel an hour ago ago

I’m currently working on a document parsing engine for a specific type of document. The inputs are usually PDFs. I’m able to get great structured output from both the latest Gemini Flash models and the latest Llama Scout models. The best latency I get with Gemini is about 5 seconds end to end. With llama hosted on groq it’s about 3 seconds.
My use case is latency constrained, so I’m exploring fine tuning / distilling to see if I can get latency sub second. I imagine these are the kinds of scenarios where it’s still worth it to fine-tune and distill.
My plan is to generate a lot of synthetic training data using more capable slower foundation models and use that to train the smaller model.
publicdaniel 2 hours ago ago

It’s really useful for generating synthetic data for search and recommendations that you can use to train a smaller / faster model. This is especially useful if you don’t have lots of click-through data or with cold start scenarios. There are some good articles that cover this, if you’re interested I’ll try to find them and share
sargstuff 12 hours ago ago

Non-AI specific 'synthetic data generation':
historically used for processes which make use of time-series / simulations & modeling / forcasting. aka weather forcasting, related points in [0]
2) a) Testing with actual 'sensitive' data may not be possible for security reasons (aka payroll information, stock market price influences)[1]. b) insufficent/incomplete information. aka figure out how well what's known matches 'reality' and/or may suggest areas to look for 'missing' pieces in model.
-----
[0] : https://www.oreilly.com/library/view/practical-time-series/9...
[1] : https://www.k2view.com/what-is-synthetic-data-generation/

[-]
- cpard 12 hours ago ago
  
  This is great. Synthetic data has been around for a long time, I think the difference with LLM related cases is that in the past it was primarily structured data that was a bit easier to approximate with some distribution or some grammar.
  With synthetic data for large languages models it’s more about QA pairs and reasoning trails for solving complicated problems
  
  [-]
  - sargstuff 4 hours ago ago
    
    Non-physics Much Ado about Shrodinger's Cat. Just tool(s) for quickly building higher order associations/abstractions from 'base term information'.[1][2][3]. aka dynamically generate a unique catlan number(s) for given Tromp lambda calcui as way of reducing tree height/lisp parentheses down to a single pair while dynamically computing/recomputing the determinant (appropriate base / number symbols ratio) to minimize length between parentheses.
    ----------------
    [1] : I told AI to make me a protein. Here's what it came up with : https://www.nature.com/articles/d41586-025-01586-y
    [2] : AI Models for Protein Structure Prediction : https://frontlinegenomics.com/ai-models-for-protein-structur...
    [3] : AI model deciphers the code in proteins that tells them where to go : https://news.mit.edu/2025/ai-model-deciphers-code-proteins-t...