Vectorisation: The next big disruptor in the database space
- Kinetica report forecasts 'the new era of Big Data Parallelism'
- Once in a decade breakthrough
- Big vectorisation systems already used by hyperscalers
- The technology could have big impact for smaller companies
Every so often a breakthrough happens that impacts and improves the database industry. The latest is vectorization which, in essence, refers to the process of converting an algorithm from operating on a single value at a time to operating on a set of values at a time. In other words, it is the ability to do single mathematical operation on a list (or "vector”) of numbers in one single step.
In fact, although vectorisation is attracting a lot of attention right now, it has a long history, making its presence felt in supercomputing back in the 1980s where vectorized arithmetic first appeared. Vectorization is attractive because operations can easily be performed in parallel by supercomputers and by multi-processors resulting in big gains in performance and work-flow at lower cost.
Indeed, these days all desktop Central Processing Units (CPUs) provide some form of support for vector operations where a Single Instruction is applied to Multiple Data (SIMD). SIMD is what vector processing was first called when the word was introduced to the computing environment in 1966. Vectorisation enables database scientists and administrators to do more with less as more power permits simpler data structures which results in reduced data storage costs, increased flexibility and less time having to be spent on engineering the data.
A new publication, “Vectorization The New Era of Big Data Parallelism” by Kinetica, (“The Database for Time & Space”) the Arlington, Virginia, US-based specialist in the real-time analysis of massive fast-moving data sets, provides a very interesting and detailed exposition of where vectorization came from and where it’s going. Vector processing is gaining popularity as new software is introduced that that efficiently exploits and optimises modern hardware. The report explains that “a vector points to a memory position in a large array of rows and columns. The columns in the memory could be a simple stream of variables (e.g. sensor readings, GPS coordinates). Or the array can be a relational database table (e.g. customer transaction records).” You can see the potential.
A little further into the report Kinetica waxes lyrical, analogising that “Vector processing is like an orchestra. The control unit is the conductor, the instructions are a musical score. The processors are the violins and cellos. Each vector has only one control unit plus dozens of small processors. Each small processor receives the same instruction from the control unit. Each processor operates on a different section of memory. Hence, every processor has its own vector pointer. Here’s the magic: that one instruction is applied to every element in the array. Only then does the processor move on to the next instruction.” That’s why SIMD is often called “data parallel” processing. The report continues; “What vectors do best is stepping through a large block of memory with a serial list of instructions. Like the conductor in the orchestra, the entire score is completed before any other music is considered.”
Vector processing is particularly well-suited to machine learning, data compression and decompression (images, for example), cryptography, multimedia including audio and video, speech and handwriting analysis and full-on usage for database requirements such as sorting, calculations and aggregations.
Consumers interact with vectorisation without being aware of it
Hyperscalers were the first companies to capitalise on the potential of vectorisation to enable their requirement for consumer services search and near real-time product recommendations. Then, as now, the webscale platforms were in intense competition and kept their databases and database tools very secure. They were expensive, sprawling, bespoke solutions optimised for specific purposes such as content and physical products. What these systems couldn’t do was scale, integrate with other platforms or connect with the cloud.
That’s why the search is on to find a solution that will allow smaller companies and organisations with traditional data warehouses and unvectorised collections of data and on the premises systems to subscribe to managed Machine Learning systems able to solve problems at great speed without having to go through the expensive and time-consuming and computer-processing intense conversion of data. Doing that is very difficult. Making massive data actionable in real time for really big applications isn’t only about converting data into such a format but also making it actionable in a real-time engine.
Kinetica ends its report by burnishing its own credentials and stressing that its strong point is the relational foundation that supports vector processing in both SMP and MPP system designs and server configurations. SMP (Symmetric Multi-Processing) is a multiprocessor system where each processor shares the same resources including the operating system (OS), memory, Input/Output devices and are connected using a common bus. MPP (Massively Parallel Processing) is the coordinated processing of a single task by multiple processors, where each processor has its own dedicated resources including its own OS and memory. Each processor communicates with another via a messaging interface. In a “share nothing” architecture there is no single point of contention across the system and nodes share neither memory or disk storage. Data is horizontally partitioned across nodes so that each node has a subset of rows from each table in the database. Each node then processes only the rows on its own disks. Such an architecture can reach and maintain massive scale because there is no bottleneck to slow down the system.
Vector databases are in increasingly common use but consumers transact with them the most often, and completely without knowing it, when they use online hyperscale platforms and services and are served advertising or recommenced products. Now, as smaller companies become aware of what vectorisation can do, they are starting to demand access to smaller scale but equally powerful systems and, in the vast majority of instances, will gain it via a managed service provided by a specialist data company. Expectations are that 2022 will be the year when that starts to happen. What it might mean for hyperscalers down the line we’ll have to wait and see, but change is coming.
Stay up to date with the latest industry developments: sign up to receive TelecomTV's top news and videos plus exclusive subscriber-only content direct to your inbox – including our daily news briefing and weekly wrap.