European Newspapers Look To Reinvent Robots.txt

from the let's-try-this-again dept

Mon, Sep 25th 2006 3:09am — Mike Masnick

After losing its appeal on Friday, Google relented and actually has posted the entire text of the ruling against it on the Google.be site. It does look a bit odd to see so much text on a Google homepage, though, Google put it in a tiny font and without any formatting at all, making it pretty difficult to read. It's still not at all clear what purpose this serves. Speaking of serving pretty much no purpose at all, it turns out that, in the wake of all this, a bunch of European newspapers are trying to create a new system by which they can tell Google not to index them. If this sounds suspiciously like the already available robots.txt file, you'd be right. In fact, in explaining the reason behind this, the publishers working on this state: "Since search engine operators rely on robotic 'spiders' to manage their automated processes, publishers' Web sites need to start speaking a language which the operators can teach their robots to understand. What is required is a standardized way of describing the permissions which apply to a Web site or Web page so that it can be decoded by a dumb machine without the help of an expensive lawyer." You know... if they only had a search engine, where they might do some searches to see if something like that already existed, it might help. But, I guess that according to many of those publishers, that would be copyright infringement.

To be fair, it does sound like these publishers are looking to put together something that goes a bit further than robots.txt, but it still looks like they're reinventing the wheel, rather than working on top of what's already there. Once again, though, this seems to all go back to the jealousy issue. This isn't about protecting content. It's about being jealous that Google has built a successful business making their content more valuable -- and they feel that the increased traffic and increased ad revenue isn't enough of a payment. It still takes quite a misunderstanding of the internet to complain when someone gives you traffic that they're not paying you enough for it. It wouldn't be unfair to then suggest that Google stop sending them traffic altogether. These publishers seem to assume they're in the power seat here, and that it's their content that makes Google valuable. That's not the case. The value in Google is its ability to make that content easier to search and find. If the publishers want to go back to the days when it was harder to find their content, that's their problem -- but it seems quite likely they'll regret it if that comes to pass.

Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.

Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.

While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.

–The Techdirt Team

30 Comments

If you liked this post, you may also be interested in...

Reader Comments

Subscribe: RSS

View by: Time | Thread

Serria, 25 Sep 2006 @ 3:56am

Google not indexing the news sites.
Google has a simple solution here, just remove all refernces to the news sites in question from their database, both search and the news portal. I bet the newspapers will figure out in a hurry that google is helping them and not the other way around.
[ link to this | view in chronology ]
Enrico Suarve, 25 Sep 2006 @ 4:00am

Still you gotta laugh...
The more this goes on the more I find it even more hilarious - as a great man once said 'sometimes the comedy just writes itself'

Possibly there are groups of users out there who just read the indexed copy and no further and never actually go onto the papers websites and I would assume this is what the papers are most worried about.

However.... I would also assume that the kind of people who just read the first few lines of an indexed or cached page, are probably the kind of people who wouldn't go to a newspaper site anyway

Seems to me the main interest is in securing revenue for their particular papers by ensuring that if you want to read the news you have to go there, however....

1) The less you advertise the less traffic you are going to get (in effect Google is giving them free advertising)
2) Unless you get a complete monopoly and every paper and news source joins in wholeheartedly, people are just going to go elsewhere cos it’s easier (and the internet is not an environment which looks kindly on monopolies)

As regards the new 'improved' robots.txt they seem to be suggesting why not just use the one already there? If their websites are so unsecured that a simple spider can go to members only pages, which they seem to suggest, I would suggest in return that the issue is with their own crap web programmers not Google. If you want a new system to work it most likely will have to be done in cooperation with the major search engines - taking them to court on a first date probably isn't the way to go about this

In a day and age when I can go to any book store and find at least 2 books devoted to how to get the googlebot to visit more often, I can't help but laugh at the ridiculousness of the order

If you want more people to visit your news site and possibly pay for it make the copy more engaging, more informed, better than the competition. Don't ban people from advertising it!

I can't wait for the next instalment
[ link to this | view in chronology ]
ScytheNoire, 25 Sep 2006 @ 5:37am

I have to agree that there should just be a ban altogether of these newpaper sites who don't want to be available in search engines.

They also obviously have very little clue about how websites work, or how search engines work.

And what about other search engines, such as Yahoo or MSN? Are they banned from doing the same thing Google apparently did that's so wrong? Or is it okay for them to do it?
[ link to this | view in chronology ]
Anonymous Coward, 25 Sep 2006 @ 6:25am

GREAT Idea, HORRIBLE reason to implement it
Robots.txt is a horridly crude method of speaking to search engines. Its time to move on.

We really do need an xml based "site directory" that includes different "content handling permissions" for different content. Permission levels, syndication rights, time limits, and all sorts of other wonderful things to allow the sites to better control how their content gets used. Right/wrong/indifferent, it should be up to the content owners how their content gets used.

HOWEVER, if a belgian court (who CLEARLY doesnt understand what the hell they are doing) drives this technology improvement, it will ABSOLUTELY get fubar'd along the way. This is the type of thing that needs to be created as an industry standard by a working group, and done in such a way that everyone sees benefit from ease of management, and enhanced search engine visibility. A court having any say in the matter will ensure its nothing but an example of waste and abuse. Some government WASTING their own money ABUSING a company providing a perfectly valid product.

(this was not intended as a troll btw, so if I look like an idiot saying this, then so be it, I'm probably an idiot.)
[ link to this | view in chronology ]
Spartikus, 25 Sep 2006 @ 6:59am

IEEE'd
It's true - a technology developed not for its purest form but to solve an impertinent (and frankly, ridiculous...) problem is going to be terribad; but I guess it's a step past robots.txt so it's not all bad. Maybe somebody who knows what they are doing can take this as a call to do the job properly.

Google takes a lot of crap from people (especially stupid ones) because of the spiders. However, if a site can't lock down their material for people who have a login or some sort of authentication then they deserve to get linked. This is clearly a case of somebody who probably got too drunk one night before work and due to a bad hangover forgot to set permissions on members only pages for members friggn only, and not wanting to lose his job, passed the buck on to Google.

Coward's right though - standardization is key here; an IEEE for Search Engine Optimization.
[ link to this | view in chronology ]
Kim Davenport, 25 Sep 2006 @ 8:01am

Google is stealing.....plain and simple
Imagine who would use Google if all you got on a search result page was sponsored links? Google's core product offering is composed of content publisher’s intellectual property. Google is not sharing revenue; they aren’t making the publishers owners in Google either. There is some publishers consortium called Congoo that is doing this…we’ll how long Google can hold out without playing ball.
[ link to this | view in chronology ]
- Anonymous Coward, 25 Sep 2006 @ 8:03am
  
  Re: Google is stealing.....plain and simple
  That was the funniest troll I have read in a long time.
  
  You were joking, right?
  [ link to this | view in chronology ]
Anonymous Coward, 25 Sep 2006 @ 8:04am

Google just needs to create a simple form that says "If you do not want us to index your page, enter the site URL to opt out of all our indexing."

I don't see that many sites going this way, and those that do, deserve what they (don't) get.
[ link to this | view in chronology ]
- Anonymous Coward, 25 Sep 2006 @ 8:14am
  
  Re:
  Google just needs to create a simple form that says "If you do not want us to index your page, enter the site URL to opt out of all our indexing."
  
  Woot! I love that idea. I'm going to buy some stock in NYT and then opt all other newspapers out of google news.
  
  But don't worry, I'll spoof your IP address when I do it.
  [ link to this | view in chronology ]
Spartikus, 25 Sep 2006 @ 8:32am

Google Breaketh not the 8th Commandment
Yeah... You DO know how Google works right?... It doesn't make any money off publishing webcontent from sites; why should it pay the owners for advertising it is providing for free? As far as sponsered links go, companies pay Google to advertise for them because people know their search results are amazing (the government actually wanted the algorithms for their search engine because of the sheer power); and Google pays people who allow Google ads to appear on their sites (so don't say they're not paying any site owners out for those).

And if you're thinking of reaching beyond that into a more broad search engine blast, are you actually suggesting that search engines in their purest form are stealing? Do you realize how ridiculously impossible it would be to use something as vast as the Internet successfully without a search engine?

I'm not saying Google is perfect, but you have to respect the fact that they have made some amazing strides in SEO, advertising, and a myriad of other fields as well (go to Google and click on Other...).
[ link to this | view in chronology ]
Victor Atehortua, 25 Sep 2006 @ 8:51am

Companies and money
We all need to remember that most companies just want money and as stupid as it seems these companies see Google as a loss in that profit.
[ link to this | view in chronology ]
Anonymous Coward, 25 Sep 2006 @ 9:11am

how much stake does IEEE have in creating web standards. i thought that was for w3c?
[ link to this | view in chronology ]
- Anonymous Coward, 25 Sep 2006 @ 11:57am
  
  Re:
  how much stake does IEEE have in creating web standards. i thought that was for w3c?
  
  You know, you've identified (historically) the right body.
  
  Unfortuantely, the W3C doesnt have a clue about getting standards adopted. They need to be disposed of and replaced with something useful.
  
  Who in their right fricking mind releases a standard KNOWING that not a single product currently conforms to it, and also fails to release ANYTHING that actually helps companies get their products to spec.
  
  Also, the w3c has made all sorts of asinine requirements to be up to spec, that have absolutely no benefit to the software makers. What good is a spec if you can't get anyone onboard with the implementation of it?
  
  At least IEEE can actually convince companies to follow the spec. W3C doesnt seem to want anyone to follow theirs.
  
  /offtopic-ness
  [ link to this | view in chronology ]
Jo Mamma, 25 Sep 2006 @ 9:49am

Belgian court is so clueless...
Sen. Stevens needs to give this Belgian court a lesson. I think the Belgian tubes are really twisted up.

I realize they're a democracy, but I'm pretty proud to be an American because at least our retarded officials aren't retarded to THIS extreme... at least I hope they're not.
[ link to this | view in chronology ]
Annoying Bastard, 25 Sep 2006 @ 10:18am

What's next?
Belgium courts dictating OS requirements?

I think the court needs to forget about technology and focus on their best known export - WAFFLES. :-P

How many people want to bet the judge in charge of the case can't send an e-mail with an attachment, let alone properly format a search query?

SOLUTION: GIVE THE LUDDITES BACK THEIR TYPEWRITERS AND BANISH THEIR ASSES FROM THE NET!

Correction fluid, anyone?
[ link to this | view in chronology ]
Someone Somewhere, 25 Sep 2006 @ 10:26am

User Review
I always thought of Google as a "Review" if you will. Kinda like when the X review club sends you a product and asks you to review it. Sometimes you get to keep the product but you are required to share your opinion with other potential buyers. When you search on Google sometimes you click the link and go to the page and sometimes you skip it and go somewhere else. Is it really their intention to make us load the ENTIRE webpage just to see the headline and first sentence? I would think that is a waste of bandwidth. In essence Google is not only making the preview of content available to me but is also making the original content more available to others!

That leads to another question: What happens to the members of those news feeds that use Google to preview the content? Do they no longer get to see the content they pay for without logging into the site? That also seems like a waste of time for someone just wanting the headlines.
[ link to this | view in chronology ]
Spartikus, 25 Sep 2006 @ 11:39am

IEEE
They don't have any as far as I know. That's why I said we need "an IEEE" for SEO. =D

Sort of like saying "I am the god of counter-strike". I'm not actually God, just the god of counter-strike.
[ link to this | view in chronology ]
Ana, 25 Sep 2006 @ 12:37pm

theyarejealousofgoogle.com
"Once again, though, this seems to all go back to the jealousy issue. This isn't about protecting content. It's about being jealous that Google has built a successful business making their content more valuable -- and they feel that the increased traffic and increased ad revenue isn't enough of a payment".
I couldn´t have said it in a more accurate way.
Ditto!
[ link to this | view in chronology ]
Anonymous Coward, 25 Sep 2006 @ 1:11pm

Mike, you might be surprised to learn how difficult Google can be to work with. They can be quite cut-and-dried and stubborn when refinements would improve outcomes for both.
[ link to this | view in chronology ]
Richard Ketchem, 25 Sep 2006 @ 2:23pm

Google is one big theif!
People keep saying that publsihers get free traffic from google....but the user is looking for the content that google created.....they are looking for content the publisher created....Google is displaying stolen content alongside advertising. This is illegal guys! Lets not make it complicated. Its illegal!
[ link to this | view in chronology ]
- Anonymous Coward, 25 Sep 2006 @ 3:15pm
  
  Re: Google is one big theif!
  Ugh... Another "google is stealing" troll. Thats two in one day.
  
  For the love of all that is logical, GET A CLUE! What google is doing is NOT STEALING.
  
  If you are trying to spread the anti-google message, you should probably just silence yourself, as you are NOT doing your cause a favor by making all anti-google zealots look like driveling morons.
  
  There I have stooped low enough to respond to a troll. Now I need to go wash myself. I feel so dirty...
  [ link to this | view in chronology ]
Paul Turner, 25 Sep 2006 @ 2:54pm

Belgians have Newspapers?
I use Google extensively every single day, I don't think I've ever noticed and certainly never used a link to ANY article in Belgium.
Maybe they view the internet as another Belgian Congo and are out to rape and pillage whatever they can in cyberspace too.
I don't know about waffles, but their chocolates and beer are great, they should stick to what they do well.
[ link to this | view in chronology ]
u no, 25 Sep 2006 @ 3:24pm

who knows
I do not really know what all of this means the first time I read it, but I hilight the unknown, right click, and the google search for the tearm. All of this said, I do not think a Belgian court can stop Google. Even if it is illegal (I think it is not) Google is still the best search engine that I know of. Plain and simple; Google is the best.
[ link to this | view in chronology ]
Anonymous Coward, 25 Sep 2006 @ 7:42pm

Google is dead in the wrong
If Google were to simply present a webpage with a list of online newspaper websites (and accompanying links to those sites), *that* would be considered the *search* service that Google claims to be providing. Or, if Google were to present a review or critique of various online newspaper sites, offering insight into how great site X is or how mundane site Y is, and even posting some example headlines from the site(s) - *that* would be considered *fair use* of the content of those newspapers. In either of these cases, there would be no arguments that Google has done no wrong.

However, this is NOT what Google is doing. Google News visits the newspaper's website, pulls ALL of the copyrighted news headlines and story snippets, and sticks it all up on Google's own site. Google returns to the newspaper's website the next day and does it again. And the next day. And the next. Google repeatedly rips off the intellectual property of these other websites day in and day out, and puts that content on Google's own website. Google is NOT simply *providing a link* to those websites to send them traffic.

The fact of the matter is that *Google's value* is increased by all of the content that has been created by many other people. It creates value for Google because people are able to visit Google and see ALL of those headlines and news snippets and associated photographs (yes, Google even steals the photos that appear with the news headlines from those newspapers' sites) in one place - the Google News website. This in turn drives increased traffic to *Google's* news site - and thereby enables Google to generate millions of dollars selling advertising space.

The news headlines, story details, and so forth are the valuable commodity that brings Google its traffic. Without that content, Google's visitors would only see a list of links to online news sites, and the Google News site's value/usefulness would be greatly reduced. *THAT* is what's at the root of this issue - and Google is clearly in the wrong. Do you think those newspaper articles (headlines and all) just write themselves? Intellectual property is intellectual property, and the news sites have spend time and effort in preparing and presenting that content ON THEIR WEBSITES. It *IS* copyrighted material, just as any published paperback sold in Barnes & Noble, and it *IS* protected by copyright law.

If Google wishes to publish this material to their website, they must seek permission and pay any royalties which the original authors/publishers may request, just as Google would have to do if it decided it was going to post a new chapter of the latest "Harry Potter" novel on its website every day.
[ link to this | view in chronology ]
- Mike (profile), 25 Sep 2006 @ 10:38pm
  
  Re: Google is dead in the wrong
  However, this is NOT what Google is doing. Google News visits the newspaper's website, pulls ALL of the copyrighted news headlines and story snippets, and sticks it all up on Google's own site.
  
  This is false. Google does not show the entire article. It shows the headline and (sometimes) a small snippet which is clearly fair use, along with a link. So, your premise is wrong.
  
  The fact of the matter is that *Google's value* is increased by all of the content that has been created by many other people.
  
  Whether or not it increases Google's value is besides the point. The issue is whether or not the newspaper sites are harmed by the practice. If they are not, they have no complaint. The fact that Google's value is increased shouldn't make a difference.
  
  It creates value for Google because people are able to visit Google and see ALL of those headlines and news snippets and associated photographs (yes, Google even steals the photos that appear with the news headlines from those newspapers' sites) in one place - the Google News website.
  
  Note that the purpose of Google News is to drive people who visit it to those individual sites that originally published the news. So, it benefits them by giving them more traffic they might not have received otherwise.
  
  This in turn drives increased traffic to *Google's* news site - and thereby enables Google to generate millions of dollars selling advertising space.
  
  Again, this is false. Google places no ads on their news site.
  
  You seem to be under some false assumptions here. Google does not display full articles, but simply links and drives traffic. Also, Google does not put ads on its news site. If you don't know those two basic facts, it's hard to take the rest of your complaint seriously.
  
  Do you think those newspaper articles (headlines and all) just write themselves?
  
  Do you think that people magically find newspaper articles online by themselves? No, they need to generate traffic... which is exactly what Google News does.
  
  It *IS* copyrighted material, just as any published paperback sold in Barnes & Noble, and it *IS* protected by copyright law.
  
  And copyright law *DOES* have something called "fair use" which is what Google uses in showing snippets. Look it up.
  [ link to this | view in chronology ]
Frank, 25 Sep 2006 @ 10:13pm

What's the actual original complaint about?
After reading a number of these teaser articles, I still haven't heard what the actual complaint was about. There is pelnty of "feedback" that certainly sounds like false strawmans.
I doubt that the issue is the 1.5 lines of context that the search term was in. I doubt it was being searched. With the "portal" like Google, I can have headlines all over it, but I assume it wasn't related to that.
It might have because of Google's cache -- where the person's content is now hosted on Google's servers. If this is the case, the newspapers could have a case. Google is [mis]appropriating their content. If they put a copyright on the page, Google is violating their copyright.
I know my local newspaper keeps the news online for 2 weeks and then charges to viewing back issues. If Google is going to cache the pages and then allow people to view their copyrighted material for free, I can see where it might cost the local paper a revenue stream.
[ link to this | view in chronology ]
- Mike (profile), 25 Sep 2006 @ 10:41pm
  
  Re: What's the actual original complaint about?
  After reading a number of these teaser articles, I still haven't heard what the actual complaint was about. There is pelnty of "feedback" that certainly sounds like false strawmans.
  
  The text of the decision is on that Google site, and we wrote about the original case when it came out (that links to the actual court order as well). That has the details, which shows the court was quite confused.
  
  It might have because of Google's cache -- where the person's content is now hosted on Google's servers. If this is the case, the newspapers could have a case. Google is [mis]appropriating their content. If they put a copyright on the page, Google is violating their copyright.
  
  If you read the decision, you see that the judge (and possibly the newspapers) were very confused. They continually switch back and forth between Google cache and Google News interchangeably, using each when it suits. That's just part of the problem with the case. It seems clear that the judge and the newspapers aren't even clear what the complaint is about.
  
  I know my local newspaper keeps the news online for 2 weeks and then charges to viewing back issues. If Google is going to cache the pages and then allow people to view their copyrighted material for free, I can see where it might cost the local paper a revenue stream.
  
  Again, there are very easy ways to opt out of the cache, so it's hardly a reasonable complaint to then ban all French and German papers from Google.be.
  [ link to this | view in chronology ]
  - Frank, 25 Sep 2006 @ 11:03pm
    
    Re: Re: What's the actual original complaint about
    Thanks for the pointer Mike. Following the links (okay, only about 2 deep from where you pointed), I found http://www.chillingeffects.org/international/notice.cgi?NoticeID=5133 which has the judgement. The [summary of the] judgement is pretty clear that the issue is the Google cache.
    Find that the activities of Google News and the use of the "Google cached violate in particular the laws on copyright and ancillary rights (1994) and the law on data bases (1998);
    Poking around Google (less than 5 minutes worth; 3 or 4 clicks from the main page), you can find: The "Cached" link will be missing for sites that have not been indexed, as well as for sites whose owners have requested we not cache their content. which indicates that Google will not cache your site if you request ... but it wasn't worth more than 5 minutes trying to figure out how to avoid being cached.
    [ link to this | view in chronology ]
    - Enrico Suarve, 26 Sep 2006 @ 12:50am
      
      Re: Re: Re: What's the actual original complaint a
      I did a reasonable amount of reading around last night (sorry I don't have the links any more) but it seems the main cause for complaint was that Google, by creating an online cache is establishing itself as a news source, a 'one stop shop' for the headlines and using copyrighted materials to do this. Only part of the complaint refers to the caching of pages and this is more to do with the idea that by caching, the papers involved lose control of their content (if they change something online it will not be reflected in Google news)
      (NOTE I don't agree with this statement - this is just what the order states)
      
      Like you say the case jumps from Google news to Google search as it suits the lawyers so even then it’s not easy to follow
      
      The order doesn't actually say ALL German and French speaking papers - just those that are represented by the plaintiff
      (Unfortunately I couldn't find exactly which papers these are)
      
      Oddly enough German isn't a national language in Belgium - French and Flemish are. Flemish is basically the Belgian dialect of Dutch and not at all the same as German
      [ link to this | view in chronology ]
andreww, 27 Sep 2006 @ 11:12pm

Google does deprive newspapers of revenue
I know that when I see something in a newpaper or magazine online which is subscriptioon only or has been archived (eg revisit to look for it a few days later) and the archive is pay per view, I google to see if i can get it the google cached version...9 times out of 10 I can. This is because google is illegally copying. It is depriving the IP owner of revenue (maybe a couple of bucks per article?). Seperately, why should the IP owner have to "opt out" ---can a burglar rob my house and then say that I didnt specifically warn him off beforehand?
[ link to this | view in chronology ]