On-Device Search Software Technology for iOS
Memkite has built search software technology that utilizes the close to supercomputer capabilities of mobiles, tablets and wearable hardware, i.e. enables building any custom large-scale search solution or even an actual implementation of Hitchiker’s Guide to the Galaxy that runs entirely on the device itself.
Mobile phones (e.g. smartphones and “phablets”), tablets and wearable devices are still primarily being treated as thin clients, i.e. they require a high-latency network call to cloud services to get things done, e.g. for search. Memkite believes that devices themselves will take over a lot of functionalities that have required cloud services in the past, the reason for this is that mobile devices are quickly on their way to become mobile supercomputers, e.g. CPU-wise (64-bit CPUs such as Apple’s A7 and Nvidia’s 192-core Tegra K1) and massive amounts of low-latency storage (e.g. Kingston’s 1 Terabyte HyperX Predator USB Stick, Lexar’s 0.5 Terabyte Professional 800X SD Card, SanDisk’s 128 GB MicroSD Card as shown in figure below and forthcoming RRAM from Crossbar that fits 1 TeraByte on a 2 square cm chip), in addition latency, privacy and availability of search will be greatly improved with having it served from devices themselves.
Figure: Example of how Memkite Search can look on an iPad
1. Example of Memkite Search Scale Experiments on iPad
- Inverted Index Search – Able to do merging 5 posting lists with 100 Million URIs each in 25 millisecond on iPad Mininote: with proper stop word and bigram handling, this would correspond to a billion+ document index
- Prefix Index (Instant Index) – 0.5 billion index entries (60GB index) on an iPad Mini
Figure: MicroSD cards have increased in capacity with a factor of 1000 in less than 10 years(!)
2. Memkite Client Software – Large-Scale Search Infrastructure for Apps
The primary purpose of the Memkite Client software is to enable low-latency search in large – GigaByte to TeraByte sized – indices on mobiles, tablets, wearable hardware (e.g. with MicroSD, SD Card, SSD or RRAM storage technology).
3. Memkite Cloud Software (Automated and Scalable Indexing of Data)
The primary purpose of the Memkite Cloud is to fetch or import data, process and index it either for preinstallation on mobile or wearable device or provide updated indices through the API layer (e.g. via REST calls). Examples of data Memkite can efficiently index and make available to run on mobile devices include:
- Large sets of Web Pages, e.g. corporate web pages
- Media: Books, Newspapers
- Semi-Structured Content: e.g. XML
It uses a mixture of open source components (e.g. Hadoop, Nginx and Tornado) combined with our custom code (in particular for ETL/data retrieval, processing and indexing). There are few tie-ins to a specific cloud platform, with small changes in layer 1 – automated cloud provisioning we can easily move to other platforms than Amazon Web Services (e.g. Google Compute Engine, Rackspace, Azure, Softlayer, Hetzner etc.), or to other data centers. The primary requirement is that machines (or virtual machines) run Linux or OS X.
4. Example of what can be made searchable on a Mobile Device (in near future)
Assuming that mobile devices soon will get 1 TeraByte storage (the MS Surface tablet already has 1/2 TeraByte and RRAM chip from Crossbar promises to deliver 1 TeraByte on 2 square centimeter chip, see below for what can be stored on it), one can store enormous amounts of data on-device, e.g.
- Knowledge sources, e.g. Wikipedia, Stackoverflow, Quora, Academic Papers, Non-fictional books, Reddit, news.ycombinator.com.
- Commercial/Shopping-related information, e.g. Ebay, Amazon.com, Alibaba, Etsy, Rakuten and Yelp Listings
- Large sources of content (broad wrt types of content), e.g. Facebook.com, Blogger.com, Tumblr.com, Yahoo.com, Twitter.com and WordPress.com
- App Stores (plenty of time to try apps and games between destinations space..), e.g. iOS App Store, Google Play and Windows 8.
- Open Source repos, e.g. github.com, bitbucket.org etc.
- Entertainment (e.g. Netflix Movies, TV Shows, Youtube, Vine, Instagram, Flickr, Fictional Books, ++
- All social network updates relevant for you.
- News updates, e.g. pick most popular content URLs published on Twitter
|Data Source||Size in Gigabytes||Aggregate Size in Gigabytes||Aggregate cm^3 assuming stack of 128GB MicroSDHC||Aggregate cm^2 assuming 1 TeraByte RRAM||Comment|
|English Wikipedia||10.67||10.67||0.013||0.021||compressed with bzip2 of XML|
|StackoverFlow.com||15||25.67||0.033||0.051||compressed with 7z of XML|
|Quora.com (Estimate)||30 – 60||55.67 – 85.67||0.071 – 0.110||0.111 – 0.171||Estimate is that it is between 2-4 times larger than stackoverflow since search queries on Bing and Google site:quora.com vs site:stackoverflow.com supports that|
|Last 10 years of ALL Academic Papers (Estimate)||3-10 GB per year (for 1.3-1.5M articles), 30-100 GB per decade||85.67 – 185.67||0.071 – 0.239||0.171 – 0.371||Rough Estimate based on comparing with Wikipedia|
|500000 Non-Fiction Textbooks||50-60||130.67 – 245.67||0.168 – 0.316||0.261 – 0.491||Rough estimate assuming average of 200-300 pages per book|
|App Store Pages||10-30||140.67 – 275.67||0.181 – 0.355||0.281 – 0.551||2-5 million app description pages in total|
|News and content updates per day||200||340 – 475||0.438 – 0.612||0.680 – 0.950||There are roughly 500M tweets published per day. On a sample with 10M tweets there were about 234K URIs with more than 1 occurence (i.e. rough popularity and quality measure) – scale this up to all tweets – 234K*(500M/10M) = 11.7M URIs – Assuming each URI points to a document with approximately 2400 words (average for top-ranked documents on the Web), and you only keep the top 2M of those 11.7M documents you get roughly 2 times the size Wikipedia per day. Assume you keep the last 10 days, i.e. only needing to download 20 GB / day – corresponding to approximately 3 minutes on a 1 GBit/s connection|
|Wikipedia Images – Baseline||20||360-495||0.464 – 0.638||0.720 – 0.990||Wikipedia has images with good Metadata, so can be used to spice up results in all/most rows above. In order to get full set of strongly compressed images (e.g. with HEVC) for all text data above one can probably multiply data sizes in each row with 2-3 (or pi) to get a rough estimate. Assuming one set aside 2 Terabyte for Images in HEVC Quality and size (ref 2KB representation of Mona Lisa), one could store almost 1 billion images.|
|Video||5300||5660 – 5795||7.303 – 7.477||11.320 – 11.590||Netflix has about 8900 movies, assuming each movie is 2 hours (120 minutes) and is encoded with good quality it would take roughly 8900*2*300 MB = 5.3 TeraByte (24 TB in HD). Have read that Netflix uses approximately 32.7% of Internet’s bandwidth capacity, so preinstalling Netflix on mobile devices may have noticeable effect on the Internet|