I’ve seen more companies claim “revolutionary AI” than I can count. As an editor who reads product launches, investor decks, and sometimes the source code when possible, I’ve learned the same pattern repeats: shiny demos, bold numbers, and—too often—missing rigor. If you’re a journalist or an investor evaluating a startup’s AI claim, you don’t need a PhD or access to proprietary datasets. You need a set of repeatable, practical experiments that reveal truth behind the marketing.
Why five simple experiments?
I’ve boiled down what I do into five experiments that balance speed, cost, and diagnostic power. Each one flags different dimensions: accuracy, robustness, transparency, cost/scale, and safety. Run them with a few hours of work and a handful of prompts or scripts, and you’ll be in a much better position to ask hard questions or to decide whether to dig deeper.
Experiment 1 — Reproducibility and cherry-picking check
Claim: “Model X achieves 98% accuracy on Y.” My first question is: can you reproduce it with the information provided?
What I track: the delta between reported and reproduced numbers, confidence intervals, and whether the company’s demo uses the same pipeline. If they cite a “confidential” dataset, ask for a stratified sample to verify performance across slices.
Experiment 2 — Edge-case and adversarial prompts
Models often look great on curated examples but fail on real-world inputs. I create an adversarial suite to probe brittleness.
What I track: percent of hallucinated facts, incoherent or logically inconsistent answers, and how often the model refuses to answer vs. fabricates. A truly robust system should degrade gracefully rather than oscillate wildly.
Experiment 3 — Freshness, data provenance, and memorization
Startups sometimes train on scraped web data and claim the model “knows” facts. I test whether the model memorizes or cites sources.
What I track: proportion of answers with verifiable citations, whether citations are fabricated, and whether the model acknowledges uncertainty. If the model fabricates plausible-sounding sources, that’s a major integrity problem.
Experiment 4 — Scale, latency, and cost per query
“Production-ready” isn’t just accuracy—it’s latency and cost under load. I simulate real-world usage to evaluate scalability.
What I track: latency distribution, error rates when scaling, and the infrastructure assumptions (cloud GPUs, specialized chips). A “revolutionary” model that only runs on an army of A100s at zero-latency claims needs scrutiny—what’s the TCO for customers?
Experiment 5 — Safety, bias, and red-team probes
Safety and bias are non-negotiable. I run a focused red-team assessment tailored to the startup’s domain.
What I track: percentage of prompts that produce unsafe content, success rate of jailbreaks, and whether documented safety mitigations exist and are effective.
Quick checklist and metrics table
| Experiment | Key metric(s) | Red flag |
| Reproducibility | Delta vs. reported accuracy, variance | No access to evaluation pipeline or huge unreproducible gaps |
| Adversarial prompts | Hallucination rate, failure modes | High brittleness on small perturbations |
| Provenance | Verifiable citations, memorization tests | Fabricated sources, unclear data provenance |
| Scale & cost | p95/p99 latency, cost per 1k queries | Unrealistic infra claims or unscalable costs |
| Safety | Jailbreak success rate, unsafe outputs | No documented mitigations or frequent unsafe outputs |
After running these experiments, I usually have a list of targeted questions to send back: Can you reproduce these numbers under these conditions? What portion of customer queries fall into the edge cases we found? How do you log and correct model errors in production? Investors and journalists should expect concrete answers, not marketing speak.
Finally, use public benchmarks and community tools as complements—not replacements—for these tests. Sites and tools like Hugging Face model cards, the BigScience evaluation suites, or independent audits can add context, but nothing replaces a few hands-on experiments tailored to the startup’s claims.
If you want, I can draft a short evaluation script (Python + requests) you can run against an API endpoint to automate reproducibility and latency tests. Tell me what platform you’ll use (OpenAI, Anthropic, self-hosted) and I’ll prepare something practical you can execute in under an hour.