Google To Newspapers: Here, Let Me Introduce You To Robots.txt
from the snappy dept
With the silly introduction last week of the AP's attempt to create a weird and totally unnecessary new data feed to keep out aggregators and search engines, it seems that Google has gotten fed up. Google execs and employees have made similar statements on various panels and discussions, but Senior Business Product Manager Josh Cohen put up a blog post directed at newspapers, that can be summarized as: Dear newspapers: let me introduce you to a tool that's been around forever. It's called robots.txt. If you don't like us indexing you, use it. Otherwise, shut up. In only slightly nicer language.Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: newspapers, robots.txt
Companies: google
Reader Comments
Subscribe: RSS
View by: Time | Thread
The response I expect to hear...
...is that robots.txt is only a request. A polite aggregator will respect it, but the wicked, devious, pirating bloggers and scrapers that the AP is fighting an urgent battle against won't.
And you have to acknowledge that there are occasional unscrupulous people who don't pay attention to robots.txt. So then the AP goes, "Ha, we were right!"
So I think Google allowing itself to be drawn into an argument about technical details is a mistake here. Better to keep the focus on the disporportionate harm to positive, legitimate use caused by an attempt to guard against a small number of true pirates.
[ link to this | view in chronology ]
Re: The response I expect to hear...
[ link to this | view in chronology ]
Re: The response I expect to hear...
Not that you can trust the newspapers not to confuse "any other random (misbehaving, non-robots.txt-respecting) search engine" with "Google".
I more expect to see one or more of the following:
* Newspapers write an invalid robots.txt file that ends up allowing Google to index their site, and they blame Google for their own technical ineptitude.
* Newspapers complain that they have to write a robots.txt file or META tag on their page, and demand Google adopt an "opt-in" policy for indexing (rather than this "opt-out" policy that is robots.txt).
* Some newspapers do everything correctly, stop their content from being indexed, and then blame Google when their traffic goes down (especially compared to the sites that aren't blocking Google, which end up getting more traffic)
* Newspapers block their content from being indexed, but other papers or sites take the same stories and republish them and allow them to be indexed, and the original papers blame Google for indexing the republishing sites.
* Newspapers ignore this blog post completely and simply continue to blame Google for indexing content they don't want indexed.
[ link to this | view in chronology ]
Re: The response I expect to hear...
[ link to this | view in chronology ]
Re: The response I expect to hear...
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re: Re:
Exactly. Google knows that the AP knows about robots.txt. So the purpose of the Google blog post is not to let the AP know about robots.txt, it's to let everyone else know that the AP knows about robots.txt, which will result in undermining the AP's arguments for a legislative "solution".
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
yeah but
The AP doesn't want people to stop using their content - they want to change the way the web works so that they can be paid whenever they think they should be.
They know that blocking google would be devastating to their industry, so instead they bitch and whine hoping that somebody will pay them to shut up.
[ link to this | view in chronology ]
sad
[ link to this | view in chronology ]
i wish
They'll never do it, users would complain about not being able to find their news, but man would it be hilarious.
[ link to this | view in chronology ]
Re: i wish
[ link to this | view in chronology ]
[ link to this | view in chronology ]
ACAP "protocol"
i encourage these guys to disallow * just so they can die faster and get replaced by better news outlets (newscientist, courthousenews, etc).
besides, nothing prevents someone from using a spider that sets the user agent as one of the standard IE/FF user agent strings. then you're stuck taking a javascript or IP address route which are also both unreliable.
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Do no evil. Right.
[ link to this | view in chronology ]
Re:
Do you have any examples of this? Or are you just making things up.
[ link to this | view in chronology ]
Re:
You tell me which is a more compeling argument...
A) We're really pissed off that Google is linking to our web site but we can't be bothered to implement a simple technical solution that would stop this.
B) We don't want Google linking to our web site, but they're ignoring our configuration and linking to it anyway.
Because the AP is choosing option A, it's all but irrelevent whether Google respects robots.txt 100% of the time. Right now, the ball is in the AP's court.
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re:
Source?
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
google DOES follow robots.txt
I think you misunderstand crawling vs indexing. Robots.txt says don't crawl my site. It doesn't mean Google can't index it - it just means they won't cache it, or visit it, or anything.
They will still show it in the search results, but only as a URL - with no snippet or text under it.
You're thinking of the noindex tag if you don't want to be listed.
[ link to this | view in chronology ]
Re: google DOES follow robots.txt
"Webmasters who do not wish their sites to be indexed can and do use the following two lines to deny permission:
User-agent: *
Disallow: /"
http://googlepublicpolicy.blogspot.com/2009/07/working-with-news-publishers.html
They can have their website not INDEXED on google if they so choose just by a simple robot.txt file.
[ link to this | view in chronology ]
The Internet and search engines are fine the way it is with REP...etc. However, Newspapers want a share of THAT "internet money" without having to do any work or use their brain to come up with something new and novel.
What do they do? Create an artificial problem by pretending they know nothing about the current Internet technology. Create another standard that's inferior to what we have right now. Whine to create pressure to make people use them.
Then everyone will have to PAY THEM to NOT USE that sh*t standard.
Business model or extortion? You tell me.
[ link to this | view in chronology ]
Opportunity Lost
[ link to this | view in chronology ]