Introducing dSearch, dTree & Our Fight For A Better Search Engine

Peeps
12 min readFeb 23, 2021

--

Over the past few months, we have been working on various ways to distribute data between peers. This led us to our work on a distributed version of a binary tree (or BTree for short), in our attempt to create a distributed, multi-writer data structure that can be used by dWeb-based search engines like dSearch (https://peepsx.com/dsearch). The development of a decentralized search engine is a complex undertaking, since there can be no servers involved in the crawling of webpages on the decentralized web, or in the storage of their indexed data. Storing all this data via a blockchain simply wasn’t possible, due to the fact that a single blockchain would never be able to handle the world’s search data and a search engine built solely on top of a blockchain would face serious scalability issues. To develop dSearch, we knew we would need to accomplish the following:

— 1.) As individual users browse the dWeb using a browser like dBrowser, the web pages they visit, would need to be crawled via their machine, and the data that results from this would have to be indexed and stored in a data structure that everyone on the dWeb could write to and read from (multi-writer);
— 2.) the data structure would need to be capable of holding billions of records; and
— 3.) the data structure must be balanced and represent hierarchical data, so that search, insert and delete operations take about “log(n)” steps, where “n” is the total number of elements that the tree holds.

This led us to build on top of the same sort of abstraction we used for dDrive, which ultimately allows for us to store the contents and metadata of millions of files within a massive file system-type hierarchy and distribute it between peers on the network effortlessly. Using this same model, each user of the dWeb will soon be able to spider the dWeb as they browse it and write the indexed data the spider gathers into a distributed binary tree known as a dTree, which can then be used to build a much larger tree of search data that dSearch can utilize for its overall engine. dSearch can traverse this gigantic tree of search data for specific keywords to render search results to those who query its engine much faster than Google can query its own servers. I know that sounds crazy, but it is simply a scientific fact. dSearch is not just an amazing project, it truly showcases the power of the dWeb and why the dWeb is much more than just a haven for freedom lovers — it is a place where we are building the apps of the future. As we roll out dTree in the next few days, I felt it was important to explain how we got to this point and what brought us to create dTree in the first place. Before diving into the specifics surrounding dTrees and dSearch, it is important that we briefly cover “Big O Notation” and “Binary Trees”.

==========
Big O Notation
==========
The efficiency of an algorithm is judged by its computational complexity. In other words, how often does an algorithm have to access its input data to complete its job? The best way to describe the complexity of an algorithm in computer science is with a method known as “Big O Notation”. To break down Big O Notation, consider the following:

— O(n) — An algorithm that only needs to access its input data once.
— O(n2) — An algorithm that only needs to access its input data twice.
— O(n3) — An algorithm that only needs to access its input data three times.
— O(n!) — Almost useless for inputs of more than 300 elements

In other words O(n) is faster than O(n2) and so on. Keep this in mind, as I breakdown the concept of a binary tree.

=========
Binary Trees
=========
A binary tree is a data structure where there is, at most, two nodes under each individual node. For example:

<code>
J
/ \
K L
/ \ / \
M N O P
</code>

This means that a single node (like K), can only be connected to two nodes at the most. There are five important concepts surrounding binary trees described below:

— 1.) Tree Root — The first node of a tree
— 2.) Tree Depth — The longest path from the tree root, to a node, also referred to as the “height” of the tree.
— 3.) Leaf Node — A node with no children, also known as a node that’s not connected to any other nodes.
— 4.) Node Depth — The number of edges from the node to the tree root.
— 5.) Balanced Tree — A tree is to be considered balanced when the tree depth is one more than the shortest distance from the tree root to a node.

Binary trees are ordered by-design, which means there is no reason to have any sort of ordering algorithm since it happens automatically. Simply put, data in a binary tree is always put in its correct place from the moment its appended to the tree itself, which keeps data from having to be ordered. If a binary tree is balanced, its search, insert and delete operations will take approximately “log(n)” steps, where the tree’s depth is approximately “log2(n)”. What this ultimately equates to, is if a tree has 10,000 elements, it will end up with a depth (height) of 14. That is incredibly small but let’s put that into perspective by describing larger trees:

— Tree of 100,000 elements = depth of 17 (estimated)
— Tree of 1,000,000 elements = depth of 20 (estimated)
— Tree of 1,000,000,000,000 elements = depth of 29 (estimated)

This means, if we add a large amount of elements to a binary tree, the speed of the tree’s operations is hardly effected. Put a different way, you can locate (search for) any node of a binary tree with one million entries, in less than 20 steps. Amazing enough, a tree that grows from 100,000 entries to 1,000,000 entries, only grows in height by a factor of 3. The same goes for a tree that grows from 10,000 entries, to 100,000 entries, considering a tree of 10,000 entries has a height of 14 and a tree of 100,000 entries has a height of 17. In other words, if a tree has 1,000,000,000 (billion) entries, and grows to 1,000,000,000,000 (trillion) entries, it will grow in height from 29 to 38 (a factor of 9).

===========
The dTree
===========
dTree is a distributed version of a binary tree built on top of dDatabase (https://developers.dwebx.com/protocols/ddatabase), which is a distributed append-only binary feed that can be distributed amongst peers on the dWeb. The only challenge with dTrees is that they’re single-writer data structures, which means multiple users would be unable to write to a dTree. Although, as you will see in a moment, that is not going to be a limitation. In fact, it makes the system much more secure and quite frankly, increases its performance. Since a dTree is a dDatabase abstraction, it has a dWeb network address like any other dDatabase abstraction (like a dDrive) and can be announced or discovered via dWeb’s DHT. In other words, its contents can be replicated amongst the peers who are swarming it. So how does dSearch utilize dTrees and how is data crawled on the dWeb?

===========
Crawling Data
===========
As users browse the dWeb via a browser like dBrowser, the web pages they visit are saved into their own individual dTrees. For example, if Bob and Alice are the only two users on the dWeb, dBrowser maintains their own indices within dTree of the web pages each of them visit — a list of web pages that continues to expand over time. Bob and Alice’s dTrees remain on their device and only they can append data to their own individual dTrees (like any dDatabase-based abstraction). For each web page they visit, a “Distributed Object Model” (DOD) made up of the page’s rendered HTML code is saved, along with its domain name, meta title and meta description. The DOD is then associated with an array of keywords, determined by the spider itself, like so:

<code>
{
data: ‘<html></html>’,
title: ‘Some Meta Title’,
description: ‘Some Meta Description’,
domain: ‘somedecentralizeddomain.net/page.html
}, {
keywords: [‘keyword1’, ‘keyword2’, …]
}
</code>

The DOD within any given dTree can be located quite efficiently by its attached keywords, especially if a dTree has under 1 trillion elements. I think it’s pretty obvious that most (if not all) users will maintain dTrees under 1 trillion DODs, considering it would be hard to browse one trillion web pages, even if the dWeb grew to that size.

===============
Twigs & dFunnel
===============
Even though each user maintains their own “baby spider”, as well as their own dTree full of DODs, this is just a small piece of a much larger tree of data that dSearch has to iterate over. It would be impossible for dSearch to decipher which dDatabases on dWeb’s DHTs are dTrees that contain crawled data, considering it would have to download the header block of every single dDatabase on the DHT (could eventually grow to billions or even trillions) and digest the block to see if it belongs to the “dsearch” protocol and would then have to query each and every one of these distributed feeds to see what they contain. If we plan on returning search results faster than Google, it should be common sense to any engineer that this would have failed. We had to find a way to constantly replicate the data from each individual dTree (known as “Twigs”) full of DODs, into a single dTree that dSearch users would live sync amongst each other. This is where things get interesting. How could we do this without introducing centralization into the mix? No sweat!

When users first start using dBrowser, a Twig is created, it is announced on dWeb’s DHT and then is announced on ARISEN using the “dsearch” smart contract. In a way, ARISEN showcases dSearch’s entire tree map, since the “dsearch” on-chain database contains every Twig’s network address. That’s great, but it gets even better. We then developed a backend data funnel, known as dFunnel, that sits on a server (I know, bad word — hold on though) in an undisclosed location somewhere in the Universe. dFunnel’s backend constantly checks ARISEN for new Twigs, checks each Twig for updates and replicates their DODs or updated DODs into a single dTree that is then announced on dWeb’s DHT. This is ok, because this is how the dWeb works! Think about how you launch a dDrive on the dWeb. It has to start from somewhere but once its used by others, it becomes more and more distributed. To add to that, when the original creator updates their dDrive, all of those who are seeding it get those updates. dFunnel is simply distributing dSearch’s global search data, just as we distribute dSearch’s application code.

In essence, dFunnel is live syncing all of the Twigs broadcasted by dSearch’s users, into a single dTree and then distributing it to users on the dWeb, just as you would distribute a dDrive. This means as a Twig is updated, dFunnel then updates the associated DOD within the single dTree it distributes. This means the crawled data that dSearch utilizes is updated by the millisecond, by users from around the world, since dFunnel is live replicating every single Twig broadcasted by dBrowser’s users as they browse the dWeb. This is something Google could never claim and did I mention that search results will be returned in a matter of a few milliseconds? Haha yea.

dFunnel is able to filter through every single DOD to make sure that each webpage is only mentioned once and that the latest version of each particular DOD is appended to the root dTree that dFunnel manages. This ensures that the DODs that dSearch is querying are always up-to-date with the websites they represent an d ensures that there are no duplicates as well. For example, if I update the META Title of a page on my website, the moment someone visits it (like me for instance), it would be saved in their Twig, dFunnel would replicate this new change, and this DOD more than likely will end up as the latest version stored within massive distributed dTree of DODs that dFunnel manages. If I were to change a META Title on my webpage on the centralized web, it would take Google potentially days or even weeks to show this updated META Title in their search results. With dSearch, this happens within seconds of a user visiting (and therefore crawling) the newly updated page.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Part 2 — — — — — — — — — — — — — — — — — — — — -

============
dIndex
============
dSearch is able to easily retrieve results via dSearch’s root dTree using a new library called “dindex”. dIndex uses simple add, get and put methods which allows for it to save and retrieve data like so:

— Adding a record to the dTree:
<code>
const DIndex = require(‘dwebindex’)
const ram = require(‘random-access-memory’)

const index = new DIndex(ram, {valueEncoding: ‘json’})

index.add({
metaTitle: ‘My Website About Freedom’,
metaDescription: ‘This website is about freedom.’,
domainName: ‘websiteforfreedom.dcom/index.html’,
html: ‘<source-from-page>’, {
keywords: [‘freedom’, ‘website-about-freedom’, ‘freedom-website’]
}
})
</code>

— Looking up a record in a dTree, by keyword:
<code>
const find = index.lookup(‘freedom’)

find.on(‘data’, (document) => {
console.log(‘DODs with keyword “freedom”:’, document)
})
</code>

This would return the DOD we just added, plus any others that use the keyword “freedom”.

The “dindex” library will debut with the “dtree” library in the next few days.

============
How We Rank Search Data
============
Ranking search data is as easy as pulling all webpages that are stored in regards to a DOD and utilizing some sort of algorithm that displays the data in a certain order. While I’m not ready to speak on our search algorithm just yet, when it comes to dWeb-based webpages, we’re able to judge the true popularity of a webpage by the amount of peers the dDrive a webpage belongs to, actually has, which makes it incredibly hard for someone to manipulate the algorithm, since it will be heavily based on popularity. While it is true that someone could create a number of fake peers, there are many network heuristics related to dDrives, that allow for us to determine actual peers, versus robotic peers — and that’s just the tip of the iceberg.

While popularity is just one aspect, we’re also developing algorithms that are able to parse HTML code, and grade the quality of the code itself, while also determining the frequency that the keywords are mentioned within the code, the tags it is mentioned within (whether it’s a page title or simply mentioned in a paragraph) and more. While these algorithms will certainly be in their infancy, they will allow for us to improve dSearch’s search accuracy, especially as the dWeb grows in size. Although, dSearch won’t simply be for the dWeb.

============
Here We Come Google
============
Since dBrowser allows its users to browse the centralized and decentralized web, it can easily place a user’s crawled activity from both webs into multiple dTrees. This means that we can keep two separate root dTrees for each web, full of crawled data. A recent study showed that Google only has the capability of crawling a small portion of the centralized web and it has been proven with distributed search engines created nearly ten years ago, that user-powered crawlers are able to venture off into what we call the “Deep Web”, much for efficiently. For this reason, we are creating dSearch in a way where users can toggle between searching the centralized web and the decentralized web, which will allow dSearch to become a true decentralized and distributed alternative to Google itself, while also allowing users like yourself to make a new journey through the freedoms dWeb provides by searching all the latest dWeb content.

While IPFS allows users to distribute files on a peer-to-peer basis, our focus has been on distributing data, which has allowed for us to develop a platform for building applications that simply work better than the centralized alternatives they’re going to have to compete with. This is allowing us to bring decentralized live-streaming to dApps like dVideo, decentralized web crawling to dSearch and decentralized voice calling to dMessenger/dSocial. As we begin the roll out of platforms like dSocial and dSearch in the coming weeks, I think it will become obvious why we spent so much time developing dWeb’s platform. The difference in quality can only be explained by utilizing the analogy of driving a Toyota and then jumping into a BMW. The difference in performance, the difference in experience and the difference in usability will be clear to anyone using our applications and nothing will bring on developer adoption more than a powerful application. Everyone wants to build the best and when they can see it, use it and feel it, they’ll know where they’ll be developing their next project.

We’re going to let the code talk. LET THE ROLL OUT BEGIN.

We will begin debuting dSocial’s code via the following links, in the coming days and weeks ahead:

https://github.com/getsocial/dsocial-web
https://github.com/getsocial/dsocial-desktop
https://github.com/getsocial/dsocial-mobile

Let the drum roll begin.

Thanks for being a part of our fight for online freedom.

Jared Rice Sr.

--

--

Peeps
Peeps

Written by Peeps

The creators of the #dweb and the world’s first decentralized and censorship-resistant social network. We’re about to #DecentralizeEverything.

No responses yet