Distributed Search Engines, And Why We Need Them In The Post-Snowden World
from the easier-said-than-done dept
One of the many important lessons from Edward Snowden's leaks is that centralized services are particularly vulnerable to surveillance, because they offer a single point of weakness. The solution is obvious, in theory at least: move to decentralized systems where subversion of one node poses little or no threat to the others. Of course, putting this into practice is not so straightforward. That's especially true for search engines: creating distributed systems that are nonetheless capable of scaling so that they can index most of the Web is hard. Despite that challenge, distributed search engines do already exist, albeit in a fairly rudimentary state. Perhaps the best-known is YaCy:
YaCy is a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. When contributing to the world-wide peer network, the scale of YaCy is limited only by the number of users in the world and can index billions of web pages. It is fully decentralized, all users of the search engine network are equal, the network does not store user search requests and it is not possible for anyone to censor the content of the shared index. We want to achieve freedom of information through a free, distributed web search which is powered by the world's users.
Another is Faroo, which has an interesting FAQ that includes this section explaining why even privacy-conscious non-distributed search engines are problematic:
...
The resulting decentralized web search currently has about 1.4 billion documents in its index (and growing -- download and install YaCy to help out!) and more than 600 peer operators contribute each month. About 130,000 search queries are performed with this network each day.Some search engines promise privacy, and while they look like real search engines, they are just proxies. Their results don't come from their own index, but from the big incumbents (Google, Bing, Yahoo) instead (the query is forwarded to the incumbent, and the results from incumbent are relayed back to the user).
Unfortunately, unlike YaCy, Faroo is not open source, which means that its code can't be audited -- an essential pre-requisite in the post-Snowden world. Another distributed search engine that is fully open source is Scholar Ninja, a new project from Jure Triglav:
Not collecting logfiles (of your ip address and query) and using HTTPS encryption at the proxy search engine doesn't help if the search is forwarded to the incumbent. As revealed by Edward Snowden the NSA has access to the US based incumbents via PRISM. If the search is routed over a proxy (aka "search engine") the IP address logged at the incumbent is that from the proxy and not from the user. So the incumbent doesn't have the users IP address, and the search engine proxy promises not to log/reveal the user IP, while HTTPS prevents eavesdropping on the way from the user to the search engine proxy.
Sounds good? By observing the traffic between user and search engine proxy (IP and time and size are not protected by HTTPS) via PRISM, Tempora (GCHQ taps world's communications) et al. and combining that with the traffic between search engine proxy and the incumbent (query, time, size are accessible by PRISM), all those seemingly private and protected information can be revealed. This is a common method know as Traffic analysis.
The NSA system XKeyscore allows to recover search engine keywords and other communication just by observing connection data (meta data) and combining them with the backend data sourced from the the incumbents. The system is also used by the German intelligence services BND and BfS. Neither the encryption with HTTPS, nor the use of proxies, nor restricting the observation to meta data is protecting your search queries or other communication content.I’ve started building a distributed search engine for scholarly literature, which is completely contained within a browser extension: install it from the Chrome Web Store. It uses WebRTC and magic, and is currently, like, right now, used by 42 people. It’s you who can be number 43. This project is 20 days old and early alpha software; it may not work at all.
As that indicates, Scholar Ninja is domain-specific at the moment, although presumably once the technology is more mature it could be adapted for other uses. It's also very new -- barely a month old at the time of writing -- and very small-scale, which shows that distributed search has a long way to go before it becomes mainstream. Given the serious vulnerabilities of traditional search engines, that's a pity. Let's hope more people wake up to the need for a completely new approach, and start to help create it.
Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: distributed computing, privacy, search engines
Reader Comments
Subscribe: RSS
View by: Time | Thread
In the end, if billions of people index all pages, it could get better than Google, too. The power of the crowd vs a single entity.
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Distributed solutions are the future.
[ link to this | view in chronology ]
Note, almost file sharing systems rely on a centralized index to allow searching and finding peers. In essence finding torrents is a smaller scale search problem, and although the actual file transfer is done on a decentralized basis, finding the file is usually centralized. File sharers are more aware than most people of the hazards of centralized systems, and include many programmers in their ranks, and are still struggling to come up with a way of decentralizing the search to avoid the problems of trackers blocked and domains being seized. This is a significantly easier problem to solve, as the indexes are much smaller, than a full index of all of the Internet that is publicly available.
[ link to this | view in chronology ]
Re:
A system where the ranking process is open source is pretty much doomed to an early death, as the results will be almost entirely spam within hours of it reaching a reasonable level of user searches.
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re: Re:
With YaCy the user controls the ranking, since it's done at the client. The user also controls their own blacklist of results. I've been running a YaCy node for over a year and really have had to blacklist only two entities - one was a porn link spammer and the other was an annoying link farm without any actual content.
A system where the ranking process is open source is pretty much doomed to an early death, as the results will be almost entirely spam within hours of it reaching a reasonable level of user searches.
Not at all. YaCy doesn't seem to be useless because of spam at all. I'm migrating to using YaCy almost exclusively now since I have it set up to crawl based on what I search for, my results are very relevant to me.
[ link to this | view in chronology ]
Re: Re: Re:
[ link to this | view in chronology ]
Re:
You know, Google has the same problems. Did you really think Google's search engine is centralized? It's not, it's distributed between thousands of nodes, each one having only part of the index. So things like computing the ranking and distributing the queries are already known to be solved.
What Google has that is centralized is trust. Google's nodes know they can trust other nodes, which simplifies things. File sharing systems usually do not have that trust. This leads to the most visible problem with decentralized search: nodes returning faked (usually spam) results.
[ link to this | view in chronology ]
Re: Re:
The huge difference between a super-computer or server farm and the Internet is the communications bandwidth available to the system, by several orders of magnitude, aided by specialized networking support at each node, like an ability to bypass the Kernel when accessing the network, and to use a local network addressing scheme to link nodes within the system. When comparing performance, all the nodes in a big barn is a centralized system compared to having the nodes spread all around the world.
[ link to this | view in chronology ]
Re: Re: Re:
[ link to this | view in chronology ]
Re:
I'd say that it is feasible or will be in a matter of a few years or even months. It may take a few more seconds instead of nonoseconds as it does on a standard search engine but that's a price I wouldn't mind paying. As for fake/spammy nodes there are tools to handle them already. On bittorrent for instance bad nodes get isolated and eventually ignored in the swarm (they were forced into such measures due to the MAFIAA poisoning swarms) so the technology is there. There will be tradeoffs for sure but it can be achieved.
[ link to this | view in chronology ]
Isn't that Bigcouch was supposed to solve?
http://bigcouch.cloudant.com/
Also Google Omega Cluster does the same thing, it doesn't matter where the server is in the world, they all act like one big machine.
https://research.google.com/pubs/pub41684.html
[ link to this | view in chronology ]
Re:
Any system that can be manipulated will be manipulated until results are completely useless to users.
Google has the best search engine; Yahoo has the second best; all others are worse than Google and Yahoo.
Google search results are manipulated by filters.
Some of these filters remove what government and pressure group consider to be inappropriate material; others manipulate what is appropriate as deemed by commercial interest; all filters except those imitated by the end users manipulated what the end user is allowed to see in an endless process of censorship.
The problem is that Google does not produce the end users desired results while recording the end user every action which can then be manipulated for the betterment of others.
Google and search engines need to be replaced by something that produces the results the end user wants without te constant surveillance.
[ link to this | view in chronology ]
Paying the peers
Of course, I have no idea on how to do that.
[ link to this | view in chronology ]
we contribute 5% of hardward for Distibutive programs
[ link to this | view in chronology ]
YaCy Tips
- Increase the RAM setting. Default is 600MB. I have a 4GB so I give YaCy 1 GB (1200MB). I would give more if this was dedicated node, but since it's my laptop, 1GB seems to play nice with other stuff that's I'm running.
- Limit crawl maximum. Default is 6000 PPM (pages per minute) and that is pretty large. I share my internet connection with other people and devices so I limit it to 300 PPM so I don't hog all the bandwidth and piss anyone off.
- Increase language ranking to 15 (max). I tend to like reading stuff in English, but that's just me.
- Turn on Heuristics settings so it automatically crawls one level deep on every page returned in the results. This way if you do a search and the results kind of suck - wait ten minutes, do the search again and the results are better because it was "learning" about what you just searched for.
I also turn on the "site operator shallow crawl". When you enter a search query in the format "site:somewebsite.com" it automatically crawls that site one level deep.
[ link to this | view in chronology ]
[ link to this | view in chronology ]
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
For YaCy it's a DHT (distributed hash table) and it's stored and shared in little bits and pieces from each user's hard drive.
Basically, it's "stored" the same way a torrent is "stored" in the swarm.
[ link to this | view in chronology ]
Re: Re:
Lets look at a grain of sand on the beach of the Internet, a Google search for Barak Obama gives :-
About 58,900,000 results (0.30 seconds)
that is almost 2Gb of data just for the links, assuming 30 characters per link. If you want a descriptive paragraph, ala Google, that would be more like 40-50 GB of data. Through in the rest of the Indexes needed to support more refined searches, and that is looking a several hundred GB just to do a decent index for one man. When distributed to user level machines, that part of the index could be spread over several hundred machines. Start scaling up the Internet, and tens of millions of machines are likely required, which makes finding which machines to query a major search in its own right.
[ link to this | view in chronology ]
Re: Re: Re:
When you search, the client receives a list of url's that contain your search word from your own index and your peers. It then verifies that word is on each of the resulting url's pages and creates the snippets at that point. The snippets aren't saved anywhere in the index. Yes, this approach adds some time when waiting for results, but it assures the resulting pages exist and removes bad links from the index.
YaCy seems to be scaling up just fine with over 350 thousand words and almost 2 billion url's currently.
[ link to this | view in chronology ]