Go to Top

Technical Feasibility of Building Hitchhiker’s Guide to the Galaxy, i.e. Offline Web Search – Part I

Memkite (@memkite) is a startup building the equivalent of Hitchhiker’s Guide to the Galaxy, see iOS App Store for a very early (and small) version. In this blog post we’ll discuss the technical feasibility of building Hitchhiker’s Guide to the Galaxy. BBC recently released a 30th year anniversary game for Hitchiker’s Guide to the Galaxy. While being an interesting fictional book in itself, it also presents many interesting technological innovations. One of them being the Hitchiker’s Guide to the Galaxy – “the Guide” – described as: “the standard repository for all knowledge and wisdom” (personal belief: can be created) 1. What would a 2014 realization of “the Guide” look like? Web Search engines – such as the 5 major ones: Baidu, Bing, Google, Yahoo and Yandex – represent a lot of the functionality one would want in an implementation of the Hitchiker’s Guide, i.e. “the standard repository for all knowledge and wisdom”, perhaps in a “Spotify for Search” (but offline) or “Flipboard Go Extra Large” style. But the problem with the web search engines are that they don’t work well in outer space:

  1. when you’re not connected to the Internet, e.g. when travelling far into space (as the intention for “the Guide”),
  2. search latency in space tend to be VERY high, e.g. even for nearby planets such as Mars (and the Mars One expedition) the search latency would provide a very poor user experience (even in 2018 when Mars is only 57.6 million kilometers away a signal at full lightspeed (i.e. 0,299792458 million kilometers/second) it would still need 6.4 minutes to get a ping at best, and even more than that to get a search result due to communication protocol overhead.

2. Ok, so the Guide will resemble Web Search, but how to get that working in space? What if you could put a web search like index and content on a device? Storage technologies can enable that: Fortunately computational storage is growing tremendously fast both in capacity and speed, e.g. last year Kingston released a (tiny) 1 TeraByte SSD Disk packaged in a USB Stick. This USB stick has more storage capacity than Google and Altavista had combined on their search clusters back in the late 90s, and the latency of SSD storage is roughly about 1/100th compared to individual disks back then. Lexar recently (January 2014) presented the 512 GB (0.5TB) SD Card Lexar Professional 800X. Other examples of recently launched high-capacity mobile storage is Sandisk’s 128 GB MicroSDXC card, it is so space-efficient – dimensions: 15 mm x 11 mm x 1.0 mm (.59in x.043in x.03in) that you can fit approximately 0.775 TeraByte of storage per cubic cm(!) – for comparison: Apple‘s iPad Air is approximately 300 cubic cm (240 mm x 169.5 mm x 7.5 mm) In this picture you can see that storage on MicroSDXC cards have increased by a factor of 1000(!) in just 9 years:   Flash/SSD storage works nicely, but another very promising forthcoming storage technology is RRAM (developed by Crossbar Inc). It allows storing 1 TeraByte on 2 square centimeter chip, and promises 20x lower power than Flash-based storage. 3. Cool enough.., but I don’t care much for late 90s Web Search, this is 2014 and I am going to space! Good point, but let’s instead focus on your information needs that you need from “the Guide” while space travelling (lets for the sake of simplicity assume that you are a software engineer, but most the sources below are of general interest), e.g.

  1. Knowledge sources, e.g. Wikipedia, Stackoverflow, Quora, Academic Papers, Non-fictional books, Reddit, news.ycombinator.com.
  2. Commercial/Shopping-related information, e.g. Ebay, Amazon.com, Alibaba, Etsy, Rakuten and Yelp Listings
  3. Large sources of content (broad wrt types of content), e.g. Facebook.com, Blogger.com, Tumblr.com, Yahoo.com, Twitter.com and WordPress.com
  4. App Stores (plenty of time to try apps and games between destinations space..), e.g. iOS App Store, Google Play and Windows 8.
  5. Open Source repos, e.g. github.com, bitbucket.org etc.
  6. Entertainment (e.g. Netflix Movies, TV Shows, Youtube, Vine, Instagram, Flickr, Fictional Books, ++)
  7. All social network updates relevant for you.
  8. News updates, e.g. pick most popular content URLs published on Twitter

How would this work for you? 4. Yes, this will probably solve almost all my information needs, but this is infeasible to provide in space?! figure: early Lego-visualization of rough estimates of data source sizes on a 2 cm2 1 TeraByte RRAM chip Understand your scepticism, but allow me to explain – with estimates for some of the above data sources – how this can be feasible to provide in space, even on single mobile device.  You’ve probably used devices with FAT filesystems before, but they’re pretty thin compared to what I am going to show you.

Data Source Size in Gigabytes Aggregate Size in Gigabytes Aggregate cm^3 assuming stack of 128GB MicroSDHC Aggregate cm^2 assuming 1 TeraByte RRAM Comment
English Wikipedia 10.67 10.67 0.013 0.021 compressed with bzip2 of XML
StackoverFlow.com 15 25.67 0.033 0.051 compressed with 7z of XML
Quora.com (Estimate) 30 – 60 55.67 – 85.67 0.071 – 0.110 0.111 – 0.171 Estimate is that it is between 2-4 times larger than stackoverflow since search queries on Bing and Google site:quora.com vs site:stackoverflow.com supports that
Last 10 years of ALL Academic Papers (Estimate) 3-10 GB per year (for 1.3-1.5M articles), 30-100 GB per decade 85.67 – 185.67 0.071 – 0.239 0.171 – 0.371 Rough Estimate based on comparing with Wikipedia
500000 Non-Fiction Textbooks 50-60 130.67 – 245.67 0.168 – 0.316 0.261 – 0.491 Rough estimate assuming average of 200-300 pages per book
App Store Pages 10-30 140.67 – 275.67 0.181 – 0.355 0.281 – 0.551 2-5 million app description pages in total
News and content updates per day 200 340 – 475 0.438 – 0.612 0.680 – 0.950 There are roughly 500M tweets published per day. On a sample with 10M tweets there were about 234K URIs with more than 1 occurence (i.e. rough popularity and quality measure) – scale this up to all tweets – 234K*(500M/10M) = 11.7M URIs – Assuming each URI points to a document with approximately 2400 words (average for top-ranked documents on the Web), and you only keep the top 2M of those 11.7M documents you get roughly 2 times the size Wikipedia per day. Assume you keep the last 10 days, i.e. only needing to download 20 GB / day – corresponding to approximately 3 minutes on a 1 GBit/s connection
Wikipedia Images – Baseline 20 360-495 0.464 – 0.638 0.720 – 0.990 Wikipedia has images with good Metadata, so can be used to spice up results in all/most rows above. In order to get full set of strongly compressed images (e.g. with HEVC) for all text data above one can probably multiply data sizes in each row with 2-3 (or pi) to get a rough estimate. Assuming one set aside 2 Terabyte for Images in HEVC Quality and size (ref 2KB representation of Mona Lisa), one could store almost 1 billion images.
Video 5300 5660 – 5795 7.303 – 7.477 11.320 – 11.590 Netflix has about 8900 movies, assuming each movie is 2 hours (120 minutes) and is encoded with good quality it would take roughly 8900*2*300 MB = 5.3 TeraByte (24 TB in HD). Have read that Netflix uses approximately 32.7% of Internet’s bandwidth capacity, so preinstalling Netflix on mobile devices may have noticeable effect on the Internet

Figure: Visualization of storing what presented in the table above. setting: (simulated) RRAM chips placed on an iPad Air Conclusion Have shown that building hitch-hikers guide data and storage-wise is highly likely to be feasible. Will in the next posting talk more about algorithms, latency and other types of enabling hardware (e.g. CPUs, GPUs and Batteries) needed to enable searching this efficiently. screenshot This is what we’re working on, feel free to reach out (e.g. investors, mobile hardware vendors, content providers in particular). We still have lots of work to do. So far, our app looks something like this. It’s got instant search, which works really smooth, even when fueled by 80GB of data.   Best regards, Memkite Team – Thomas (thomas@memkite.com), Torbjørn (torb@memkite.com), Amund (amund@memkite.com) Link to discussion about this blog post on Hacker News – https://news.ycombinator.com/item?id=7507566 This blog post is partially funded by the EU project FP7-610582 ENVISAGE: Engineering Virtualized Services (http://www.envisage-project.eu).


Product Team Technology info@memkite.com twitter: @memkite

,

About Amund Tveit (@atveit - amund@memkite.com)

Amund Tveit works in Memkite on developing large-scale Deep Learning and Search (Convolutional Neural Network) with Swift and Metal for iOS (see deeplearning.education for a Memkite app video demo). He also maintains the deeplearning.university bibliography (github.com/memkite/DeepLearningBibliography)

Amund previously co-founded Atbrox , a cloud computing/big data service company (partner with Amazon Web Services), also doing some “sweat equity” startup investments in US and Nordic startups. His presentations about Hadoop/Mapreduce Algorithms and Search were among top 3% of all SlideShare presentations in 2013 and his blog posts has been frequently quoted by Big Data Industry Leaders and featured on front pages of YCombinator News and Reddit Programming

He previously worked for Google, where he was tech.lead for Google News for iPhone (mentioned as “Google News Now Looks Beautiful On Your iPhone” on Mashable.com), lead a team measuring and improving Google Services in the Scandinavian Countries (Maps and Search) and worked as a software engineer on infrastructure projects. Other work experience include telecom (IBM Canada) and insurance/finance (Storebrand).

Amund has a PhD in Computer Science. His publications has been cited more than 500 times. He also holds 4 US patents in the areas of search and advertisement technology, and a pending US patent in the area of brain-controlled search with consumer-level EEG devices.

Amund enjoys coding, in particular Python, C++ and Swift (iOS)

9 Responses to "Technical Feasibility of Building Hitchhiker’s Guide to the Galaxy, i.e. Offline Web Search – Part I"

  • Anon
    April 1, 2014 - 2:28 pm Reply

    You’re *drastically* underestimating scientific paper and book data sizes. By a factor of about 100, compressed, based on my collections.

    • memkite
      May 2, 2014 - 12:51 pm Reply

      It was only a rough estimate. Here is an improved one:

      Academic papers typically contain between 3000 and 6000 words per paper, i.e. with 1.3-1.6 million papers per year that ranges between 3900 million (3.9 billion) and 9600 million (9.6 billion) words per year. English Wikipedia (which compressed is approximately 10.6 GB) has approximately 2.6 billion words, i.e. if scaling up: 3.9 billion words => 3.9/2.6 * 10.6 = 16.5 GB / year, and 9.6 billion words => 9.6/2.6 * 10.6 = 39.1 GB/year. If one only keeps papers that have only been cited once, i.e. approximately 10% of academic papers according to: The Rise and Rise of Citation Analysis – this would represent the amounts needed to store 10 years of all (at least cited once) papers

  • Moschops
    April 1, 2014 - 3:47 pm Reply

    What do you envisage the physical device looking like? I gather (and hope) from the above that this isn’t going to be “just” a software+data package to be run on a laptop or tablet of convenience.

  • Mobile Eats the Cloud | Memkite
    April 24, 2014 - 6:53 am Reply

    […] Feel free to check out our previous blog post – Technical Feasibility of Building Hitchiker’s Guide to the Galaxy, i.e. Offline Web Search &#8… […]

  • Nicholas Joll
    June 27, 2014 - 10:36 am Reply

    Interesting. But: no updates to the content of the envisaged device via the ‘sub-etha net’, then? Ah – I’ve just looked at the prototype app – the idea is that the device *can* work offline, i.e. if it needs to.

    • memkite
      July 8, 2014 - 11:30 am Reply

      Yes, the idea is that it can update when you are online, and when you are offline you have as fresh content as when you last updated.

      btw. interesting book you have, might check it out :)

      Best regards,
      Amund

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>