Using Gzip To Identify Authors Of Text

from the odd-uses dept

An article at ABCNews about some researchers who have figured out a way to use Gzip to identify authors of text files. They point out that in compressing a document, Gzip has to learn about it, to figure out what it can compress - and then it can use what it learns to identify similar documents. The researchers ran a test where they were 93% successful in identifying authors of sample texts using this process, but it only had to chose from 11 different authors. Something about it sounds fishy to me, though. It's unclear from the description of the study if there was any sort of control group as well. Even the guy who wrote Gzip is skeptical.
Hide this

Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.

Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.

While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.

–The Techdirt Team


Reader Comments

Subscribe: RSS

View by: Time | Thread


  1. identicon
    mhh5, 30 Jan 2002 @ 5:59pm

    utter crap.

    Correct me if I'm wrong, but gzip doesn't "learn" anything. While it may do some simple pattern matching, I don't think it has anywhere near the "learning capabilities" to distinguish more than a few "trained" samples.

    This is just a statistical anomaly. Gzip is not a magic alternative to artificial intelligence. Shame on masnick for not slamming this article harder. Didn't you TA a stat class? Isn't this just a case of poor sampling size?

    link to this | view in thread ]

  2. identicon
    Ed, 30 Jan 2002 @ 7:01pm

    Re: utter crap.

    The capabilities sound a bit oversold to me, but there's probably something to it. Unfortunately trying to connect to the web servers at a university in Italy hasn't been very fruitful, but I can surmise that the gzip compression ratio is used as a measure of the entropy in the text. With only a single number to go on, you couldn't pick the author out of a large population, but it might be useful in deciding whether something was written by Author X or Author Y.

    link to this | view in thread ]


Follow Techdirt
Essential Reading
Techdirt Deals
Report this ad  |  Hide Techdirt ads
Techdirt Insider Discord

The latest chatter on the Techdirt Insider Discord channel...

Loading...
Recent Stories

This site, like most other sites on the web, uses cookies. For more information, see our privacy policy. Got it
Close

Email This

This feature is only available to registered users. Register or sign in to use it.