Google To Newspapers: Here, Let Me Introduce You To Robots.txt

from the snappy dept

Thu, Jul 16th 2009 8:31am — Mike Masnick

With the silly introduction last week of the AP's attempt to create a weird and totally unnecessary new data feed to keep out aggregators and search engines, it seems that Google has gotten fed up. Google execs and employees have made similar statements on various panels and discussions, but Senior Business Product Manager Josh Cohen put up a blog post directed at newspapers, that can be summarized as: Dear newspapers: let me introduce you to a tool that's been around forever. It's called robots.txt. If you don't like us indexing you, use it. Otherwise, shut up. In only slightly nicer language.

Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.

Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.

While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.

–The Techdirt Team

Filed Under: newspapers, robots.txt
Companies: google

27 Comments

If you liked this post, you may also be interested in...

Reader Comments

Subscribe: RSS

View by: Time | Thread

Petréa Mitchell, 16 Jul 2009 @ 9:17am

The response I expect to hear...
...is that robots.txt is only a request. A polite aggregator will respect it, but the wicked, devious, pirating bloggers and scrapers that the AP is fighting an urgent battle against won't.

And you have to acknowledge that there are occasional unscrupulous people who don't pay attention to robots.txt. So then the AP goes, "Ha, we were right!"

So I think Google allowing itself to be drawn into an argument about technical details is a mistake here. Better to keep the focus on the disporportionate harm to positive, legitimate use caused by an attempt to guard against a small number of true pirates.
[ link to this | view in chronology ]
- technomage (profile), 16 Jul 2009 @ 9:30am
  
  Re: The response I expect to hear...
  While I agree with your overall assumption, I know that those "unscrupulous" people will find ways around the paywalls as well. The newspapers are trying to make a blanket rule against everyone, but the problem is any blanket always has holes. The fact that these newspaper corps refuse to use the tools already at hand, but would rather force the government to get involved to make new rules and tools, shows just how out of touch they really are concerning technology. Creating new standards to benefit only one thing does not solve the underlying issue: People want news, they want it now, and they don't want to jump through hoops to get it.
  [ link to this | view in chronology ]
- Yakko Warner, 16 Jul 2009 @ 9:45am
  
  Re: The response I expect to hear...
  That may be true, but their argument will be with the aggregators that do not respect robots.txt, not with Google (which, according to the point of the blog, is a "polite aggregator" and respects robots.txt).
  
  Not that you can trust the newspapers not to confuse "any other random (misbehaving, non-robots.txt-respecting) search engine" with "Google".
  
  I more expect to see one or more of the following:
  
  * Newspapers write an invalid robots.txt file that ends up allowing Google to index their site, and they blame Google for their own technical ineptitude.
  
  * Newspapers complain that they have to write a robots.txt file or META tag on their page, and demand Google adopt an "opt-in" policy for indexing (rather than this "opt-out" policy that is robots.txt).
  
  * Some newspapers do everything correctly, stop their content from being indexed, and then blame Google when their traffic goes down (especially compared to the sites that aren't blocking Google, which end up getting more traffic)
  
  * Newspapers block their content from being indexed, but other papers or sites take the same stories and republish them and allow them to be indexed, and the original papers blame Google for indexing the republishing sites.
  
  * Newspapers ignore this blog post completely and simply continue to blame Google for indexing content they don't want indexed.
  [ link to this | view in chronology ]
- stat_insig (profile), 16 Jul 2009 @ 9:46am
  
  Re: The response I expect to hear...
  Well......... There is an easy solution if you don't want people to access your content..... keep it offline!
  [ link to this | view in chronology ]
- MattP, 16 Jul 2009 @ 2:50pm
  
  Re: The response I expect to hear...
  FTA: "REP isn't specific to Google; all major search engines honor its commands."
  [ link to this | view in chronology ]
Anonymous Coward, 16 Jul 2009 @ 9:34am

The reason the robots.txt argument is flawed is because the newspaper don't want google to remove them from search. They want to show up in google. They just want google to pay them as well. It's nothing more than a money grab. Google says, "If you have good, useful content, we will rate you high in our index. You will get traffic and we will make a little money from ads on the search page." Newspapers say, "Sounds like a good deal. Except, we want you to pay us as well." Google says, "No, we don't think we should have to pay you. If you don't like the deal you can opt out." Newspapers say, "No, we like the deal, but you still have to pay us." Google says, "WTF."
[ link to this | view in chronology ]
- Ryan Z, 16 Jul 2009 @ 9:37am
  
  Re:
  Well, that doesn't make the argument flawed, does it? It just means the AP doesn't have a leg to stand on, except the politicians they've paid off, of course.
  [ link to this | view in chronology ]
  - Hulser (profile), 16 Jul 2009 @ 10:34am
    
    Re: Re:
    Well, that doesn't make the argument flawed, does it? It just means the AP doesn't have a leg to stand on
    
    Exactly. Google knows that the AP knows about robots.txt. So the purpose of the Google blog post is not to let the AP know about robots.txt, it's to let everyone else know that the AP knows about robots.txt, which will result in undermining the AP's arguments for a legislative "solution".
    [ link to this | view in chronology ]
- John Doe, 16 Jul 2009 @ 9:46am
  
  Re:
  This is exactly what is going on. Newspapers are trying to get the government to point their guns at Google to make them hand over bags full of money. It is an attempt at "legal" extortion.
  [ link to this | view in chronology ]
- Anonymous Coward, 16 Jul 2009 @ 9:55am
  
  Re:
  I completely agree with google here, these newspaper companies are being evil and selfish. If they don't want Google linking to them then MAKE A ROBOTS.TXT file or TELL google to remove them from their index. I'm sure Google will be more than glad to remove them from their index. But you can't force someone to use your product and then force them to pay for it. That's like when the RIAA tried to force people to buy their music (and not boycott it) and then they tried to force them to pay for it ( http://www.techdirt.com/articles/20090616/1527385253.shtml ). Nonsense.
  [ link to this | view in chronology ]
Ryan, 16 Jul 2009 @ 9:51am

yeah but
the problem with robots.txt is that nothing in the specification involves people paying the AP.

The AP doesn't want people to stop using their content - they want to change the way the web works so that they can be paid whenever they think they should be.

They know that blocking google would be devastating to their industry, so instead they bitch and whine hoping that somebody will pay them to shut up.
[ link to this | view in chronology ]
duderino, 16 Jul 2009 @ 9:54am

sad
This is just sad that Google has to keep making these kind of public statements while the AP doesn't read it, and then they keep digging themselves a bigger hole.
[ link to this | view in chronology ]
Ryan, 16 Jul 2009 @ 10:01am

i wish
I wish that google would call the papers bluff and completely remove them from search results - offering only to re-include them if and when they put up a robots.txt file

They'll never do it, users would complain about not being able to find their news, but man would it be hilarious.
[ link to this | view in chronology ]
- Anonymous Coward, 16 Jul 2009 @ 10:30am
  
  Re: i wish
  I have a better idea. Remove them from the search results/news/everything and make them pay to come back in.
  [ link to this | view in chronology ]
Anonymous Coward, 16 Jul 2009 @ 10:10am

Where can I write the politicians that these newspapers are lobbying, everyone should write the politicians and explain to them the technology and how absurd it is for these stupid newspapers to come crying to them for money grabs from google.
[ link to this | view in chronology ]
Anonymous Coward, 16 Jul 2009 @ 10:17am

ACAP "protocol"
have you looked through the ACAP document? it's like 40 pages long, and all they do is explain how to use robots.txt for the first 35 pages or so. then they introduce a few new tags for inline meta types and markup for robots.txt... while disallowing *.

i encourage these guys to disallow * just so they can die faster and get replaced by better news outlets (newscientist, courthousenews, etc).

besides, nothing prevents someone from using a spider that sets the user agent as one of the standard IE/FF user agent strings. then you're stuck taking a javascript or IP address route which are also both unreliable.
[ link to this | view in chronology ]
Anonymous Coward, 16 Jul 2009 @ 10:18am

Fine block your sites from Google - and I'll just find another - no big deal. That's the POINT of Google - finding another site.
[ link to this | view in chronology ]
Anonymous Coward, 16 Jul 2009 @ 10:31am

Google also does not respect robots.txt 100% of the time either, often indexing internal pages that are blocked by the robots file because of direct external links.

Do no evil. Right.
[ link to this | view in chronology ]
- Anonymous Coward, 16 Jul 2009 @ 10:43am
  
  Re:
  "Google also does not respect robots.txt 100% of the time either"
  
  Do you have any examples of this? Or are you just making things up.
  [ link to this | view in chronology ]
- Hulser (profile), 16 Jul 2009 @ 10:44am
  
  Re:
  Google also does not respect robots.txt 100% of the time
  
  You tell me which is a more compeling argument...
  
  A) We're really pissed off that Google is linking to our web site but we can't be bothered to implement a simple technical solution that would stop this.
  
  B) We don't want Google linking to our web site, but they're ignoring our configuration and linking to it anyway.
  
  Because the AP is choosing option A, it's all but irrelevent whether Google respects robots.txt 100% of the time. Right now, the ball is in the AP's court.
  [ link to this | view in chronology ]
  - Anonymous Coward, 16 Jul 2009 @ 10:46am
    
    Re: Re:
    It's time for everyone to boycott the AP.
    [ link to this | view in chronology ]
- The Buzz Saw (profile), 16 Jul 2009 @ 10:47am
  
  Re:
  I'd be interested in seeing proof of this statement. I'm not trying to be obnoxious or anything; I'm just genuinely interested to see this happen. Several people have mentioned that Google does not always honor robots.txt, but I have never seen any matching evidence.
  
  Source?
  [ link to this | view in chronology ]
  - Ryan, 16 Jul 2009 @ 10:56am
    
    Re: Re:
    Yeah, I don't see this...Google is going to code their bots the same way, so it'll treat every site the same way. Unless they added in exceptions to specific sites, although I don't know why they'd do that. Do they have a shit list of webmasters they don't like that they periodically update in their scrapers? Seems to me like an exception would be an improperly used robots.txt file.
    [ link to this | view in chronology ]
Ryan, 16 Jul 2009 @ 11:22am

google DOES follow robots.txt
"Google also does not respect robots.txt 100% of the time either"

I think you misunderstand crawling vs indexing. Robots.txt says don't crawl my site. It doesn't mean Google can't index it - it just means they won't cache it, or visit it, or anything.

They will still show it in the search results, but only as a URL - with no snippet or text under it.

You're thinking of the noindex tag if you don't want to be listed.
[ link to this | view in chronology ]
- Anonymous Coward, 16 Jul 2009 @ 11:53am
  
  Re: google DOES follow robots.txt
  Robot.txt has many commands that can say many different things INCLUDING don't index my site. See the post by Google's blog.
  
  "Webmasters who do not wish their sites to be indexed can and do use the following two lines to deny permission:
  
  User-agent: *
  Disallow: /"
  
  http://googlepublicpolicy.blogspot.com/2009/07/working-with-news-publishers.html
  
  They can have their website not INDEXED on google if they so choose just by a simple robot.txt file.
  [ link to this | view in chronology ]
william, 16 Jul 2009 @ 1:08pm

Okay, let's put it this way. problem = opportunity = money.

The Internet and search engines are fine the way it is with REP...etc. However, Newspapers want a share of THAT "internet money" without having to do any work or use their brain to come up with something new and novel.

What do they do? Create an artificial problem by pretending they know nothing about the current Internet technology. Create another standard that's inferior to what we have right now. Whine to create pressure to make people use them.

Then everyone will have to PAY THEM to NOT USE that sh*t standard.

Business model or extortion? You tell me.
[ link to this | view in chronology ]
MattP, 16 Jul 2009 @ 2:57pm

Opportunity Lost
"Today, more than 25,000 news organizations across the globe make their content available in Google News and other web search engines. They do so because they want their work to be found and read -- Google delivers more than a billion consumer visits to newspaper web sites each month." I'm sure there are plenty of sources wanting a share of 1 billion visitors a month. Let AP die and move on.
[ link to this | view in chronology ]