
Kiran Shahid reported on May 29, 2026 that hours of hands‑on testing revealed major inconsistencies among popular AI content detectors, a finding that undermines their reliability for judging authorship and could have real consequences for educators, editors and clients. Her experience — and a small controlled experiment — showed the same text can be judged very differently depending on the tool used. This matters because overreliance on automated scores risks mislabeling non‑native or stylistically predictable human writing as machine‑generated.
Shahid says she ran countless writing samples through detectors while trying to achieve a “human” rating for a client writing test; a LinkedIn post about that effort went viral. At one point she spent 45 minutes trying to produce a 100% human‑written rating and, by the tenth hour of testing, realized she had been chasing a metric instead of focusing on the writing itself. The anecdote highlights how pursuit of a score can distort practical judgment about authorship and quality.
She summarizes how detectors work: most analyze sentence structure, vocabulary choices and syntax patterns, flagging low variation or unusually uniform constructions as likely AI‑generated. Shahid also notes detectors and generators share modeling principles — they both rely on patterns of syntax, token probabilities and style features — which makes detectors prone to false positives when human writing is predictable or stylistically narrow.
To test detector behavior, Shahid prepared four short pieces: a poorly written AI sample, a poorly written human sample, a well‑written AI sample and a well‑written human sample. She fed those pieces into three commonly used detectors — ZeroGPT, Copyleaks and TraceGPT — to see whether the tools would consistently separate AI from human authorship across different quality levels.
The tools disagreed markedly on at least one clear case. Using what Shahid calls an ineffective Claude prompt to produce a deliberately poor AI sample, ZeroGPT labeled the output 100% AI‑generated, TraceGPT rated it roughly 75% likely AI, and Copyleaks reported an AI presence but did not provide a percentage. Shahid emphasizes that even blatantly bad AI output can be classified differently across detectors, underlining inconsistent scoring and limited comparability between tools.
She supports her hands‑on findings with wider evidence: in a pilot study, seven GPT detectors misclassified an average of 61.22% of TOEFL essays written by non‑native English speakers as AI‑generated. Taken together, Shahid warns that automated detectors should not be used as the sole arbiter of authorship: she points readers to manual ways of spotting AI‑like signals and urges writers to focus on making text authentically their own, while cautioning that overreliance on detectors risks unfairly penalizing non‑native authors and well‑intentioned human writers.
Sources
Replies (0)
No replies in this topic yet.