Tech

How to test a startup's claim of 'revolutionary' ai with five simple experiments any journalist or investor can run

How to test a startup's claim of 'revolutionary' ai with five simple experiments any journalist or investor can run

I’ve seen more companies claim “revolutionary AI” than I can count. As an editor who reads product launches, investor decks, and sometimes the source code when possible, I’ve learned the same pattern repeats: shiny demos, bold numbers, and—too often—missing rigor. If you’re a journalist or an investor evaluating a startup’s AI claim, you don’t need a PhD or access to proprietary datasets. You need a set of repeatable, practical experiments that reveal truth behind the marketing.

Why five simple experiments?

I’ve boiled down what I do into five experiments that balance speed, cost, and diagnostic power. Each one flags different dimensions: accuracy, robustness, transparency, cost/scale, and safety. Run them with a few hours of work and a handful of prompts or scripts, and you’ll be in a much better position to ask hard questions or to decide whether to dig deeper.

Experiment 1 — Reproducibility and cherry-picking check

Claim: “Model X achieves 98% accuracy on Y.” My first question is: can you reproduce it with the information provided?

  • Ask the company for the exact evaluation dataset, preprocessing steps, model checkpoints, random seeds, and evaluation scripts.
  • If they won’t share, try to reproduce their claim using public datasets that are commonly used for the task (e.g., SQuAD, MNLI, ImageNet variants). Note differences in preprocessing—tokenization, truncation, sampling—that can swing results.
  • Run the same evaluation multiple times with different random seeds and compute variance. High claimed accuracy with huge variance is a red flag.
  • What I track: the delta between reported and reproduced numbers, confidence intervals, and whether the company’s demo uses the same pipeline. If they cite a “confidential” dataset, ask for a stratified sample to verify performance across slices.

    Experiment 2 — Edge-case and adversarial prompts

    Models often look great on curated examples but fail on real-world inputs. I create an adversarial suite to probe brittleness.

  • Design a short list (50–200) of edge cases: out-of-distribution inputs, misspellings, code-switching, rare names, and intentionally ambiguous queries.
  • Include adversarial prompts that have fooled other models—prompt injection attempts, contradictory instructions, or tasks that require long-range reasoning.
  • Compare outputs to a baseline model such as GPT-4, Claude, or an open model (Llama 2). Note differences in hallucination rate, confidence, and failure modes.
  • What I track: percent of hallucinated facts, incoherent or logically inconsistent answers, and how often the model refuses to answer vs. fabricates. A truly robust system should degrade gracefully rather than oscillate wildly.

    Experiment 3 — Freshness, data provenance, and memorization

    Startups sometimes train on scraped web data and claim the model “knows” facts. I test whether the model memorizes or cites sources.

  • Ask for a model card, data sources, and any copyright/licensing controls used during training.
  • Pose questions about recent events (e.g., events that occurred within the last 6–12 months) and ask the model explicitly to cite sources, including URLs and publication names.
  • Check for memorization by querying for unusual strings or sequences commonly found in dataset leaks (unique sentences from public datasets).
  • What I track: proportion of answers with verifiable citations, whether citations are fabricated, and whether the model acknowledges uncertainty. If the model fabricates plausible-sounding sources, that’s a major integrity problem.

    Experiment 4 — Scale, latency, and cost per query

    “Production-ready” isn’t just accuracy—it’s latency and cost under load. I simulate real-world usage to evaluate scalability.

  • Run a load test: send concurrent queries at the rate realistic for their claimed use case (e.g., 10–100 QPS for a customer-facing assistant).
  • Measure median and tail latencies (p50, p95, p99) and track error rates under stress.
  • Estimate cost per 1,000 queries using their pricing or, for on-prem models, estimate GPU hours and memory footprint. Ask how the cost scales with model size and throughput.
  • What I track: latency distribution, error rates when scaling, and the infrastructure assumptions (cloud GPUs, specialized chips). A “revolutionary” model that only runs on an army of A100s at zero-latency claims needs scrutiny—what’s the TCO for customers?

    Experiment 5 — Safety, bias, and red-team probes

    Safety and bias are non-negotiable. I run a focused red-team assessment tailored to the startup’s domain.

  • Create prompts that elicit toxic, biased, or unsafe responses relevant to the product (e.g., medical, legal, political scenarios).
  • Check mitigation: does the model decline, provide safe alternatives, or give misleading/unsafe answers? Test adversarial jailbreak prompts to see how quickly safety measures fail.
  • If the model makes sensitive claims (medical diagnosis, legal advice), verify whether the company has human-in-the-loop safeguards and documented incident response policies.
  • What I track: percentage of prompts that produce unsafe content, success rate of jailbreaks, and whether documented safety mitigations exist and are effective.

    Quick checklist and metrics table

    Experiment Key metric(s) Red flag
    Reproducibility Delta vs. reported accuracy, variance No access to evaluation pipeline or huge unreproducible gaps
    Adversarial prompts Hallucination rate, failure modes High brittleness on small perturbations
    Provenance Verifiable citations, memorization tests Fabricated sources, unclear data provenance
    Scale & cost p95/p99 latency, cost per 1k queries Unrealistic infra claims or unscalable costs
    Safety Jailbreak success rate, unsafe outputs No documented mitigations or frequent unsafe outputs

    After running these experiments, I usually have a list of targeted questions to send back: Can you reproduce these numbers under these conditions? What portion of customer queries fall into the edge cases we found? How do you log and correct model errors in production? Investors and journalists should expect concrete answers, not marketing speak.

    Finally, use public benchmarks and community tools as complements—not replacements—for these tests. Sites and tools like Hugging Face model cards, the BigScience evaluation suites, or independent audits can add context, but nothing replaces a few hands-on experiments tailored to the startup’s claims.

    If you want, I can draft a short evaluation script (Python + requests) you can run against an API endpoint to automate reproducibility and latency tests. Tell me what platform you’ll use (OpenAI, Anthropic, self-hosted) and I’ll prepare something practical you can execute in under an hour.

    You should also check the following news:

    What to ask your council this month to stop a planned library closure and win community support
    Politics

    What to ask your council this month to stop a planned library closure and win community support

    I found out my local council was planning to close our public library the way most of us do now: a...

    Which home ev charger grant applies to your uk postcode and how to claim it without costly mistakes
    Lifestyle

    Which home ev charger grant applies to your uk postcode and how to claim it without costly mistakes

    I’m often asked by friends, neighbours and readers which home EV charger grant applies to their...