It might not look like it, but Oracle is still in the high-end server business, at least when it comes to large machines running its relational database of the same name. In fact, the company has launched a new generation of Exadata database servers, and the architecture of these machines shows what is important – and what is not – for a better operation of a clustered database. At least one based on the Oracle software stack.
The Exadata X1 database appliances officially launched in September 2008, but Oracle had been shipping them to select customers for a year when the Great Recession hit and large companies are spending fortunes on large NUMA servers and storage area networks (SANs) with many fiber channel switches to connect the computing power with the storage. Looking for a way to spend less money, Oracle worked with Hewlett Packard’s ProLiant servers division to create a cluster of off-the-shelf X86 servers, flash-accelerated storage engines, and an InfiniBand link called Remote Direct Memory Access used low latency (RDMA) to tightly couple nodes to run the database and storage. The networking with the customers took place via Ethernet network interfaces. In a way, Oracle used InfiniBand as a backplane and therefore participated in Mellanox Technologies even then.
After this experience and after he had learned that IBM was about to take over the Unix system house Sun Microsystems worth 6.85 billion in January 2010, the deal was closed. In September 2009, Ellison was so certain the deal would get regulatory approval that the HP iron would soon be separated from the Exadata line and replaced by Sun X86 machines with the Linux variant from Oracle – not Sun Sparc Devices with Solaris Unix. These were the second generation Exadata V2 machines, followed by the Exadata X2 and so on. By the time The next platform Released in 2016 for its first year, Oracle was already on the seventh generation of the Exadata X6, with the cranks in computing, storage and networking.
As you can see from the table above, the hard drive and flash storage capacity, the number of CPU cores, the storage capacity of the database nodes, and the Ethernet bandwidth in the Exadata clusters grew steadily in the first decade of products. The Exadata X7-2 and X7-8 systems were introduced in October 2017, and Oracle had thousands of customers in all sorts of industries who installed their large NUMA machines with the Oracle database (the dominant driver of Unix machines three decades ago, two decades ago, a decade ago and now) and replaced them with Exadata irons.
In each Exadata generation, the models with the designation “2” have relatively thin main memory and no local flash on the database servers, and the models with the designation “8” have eight times the main memory (terabytes instead of hundreds of gigabytes) per node and eight Xeons Processor sockets instead of two. And from the Exadata X8-2 and X8-8 generation in June 2019, Oracle switched from InfiniBand to 100 Gb / sec Ethernet with RoCE extensions for RDMA for coupling the nodes in the cluster to one another as well as four 10 Gb / sec or two 25 Gb / s ethernet ports per database node to communicate with the outside world.
With the X8 generation, two versions of Exadata storage servers came onto the market: a High Capacity (HC) version, which combines flash cards and drives, and an Extreme Flash (EF) version, which has twice as many PCI Express -Flash cards but had no drives (which offered maximum throughput but much lower capacity). Oracle also began using machine learning to automatically optimize the clustered database – exactly what AI is good at and humans are less good at.
This little story brings us to Oracle’s 10th generation of Exadata: the X9M-2 and X9M-8 systems announced last week, which are unprecedented in size for running clustered relational databases.
The X9M-2 database server has a pair of 32-core “Ice-Lake” Xeon SP processors (ie 64 cores) that run at 2.6 GHz and comes with a base memory of 512 GB of main memory, which can be upgraded to 2 TB in 512 GB increments. The X9M-2 database server has a pair of 3.84TB NVM Express flash drives and another pair can be added. Again, the two socket database node can have four 10 Gbps ports or two 25 Gbps plain vanilla Ethernet ports for connection to applications and users, and it has a pair of 100 Gbps. s-RoCE ports for connection to the database and storage server fabric.
The X9M-8 database node is intended for larger database nodes that require more cores and more main memory in order to burn through more transactions or to burn transactions through faster. It has two four-socket motherboards connected with UltraPath Interconnect NUMA fabrics to create an eight-socket shared storage system. (This is all based on Intel chipsets and has nothing to do with Sun technology.) The 9XM-8 database server has eight 24-core Ice Lake Xeon SP 8268 processors running at 2.9 GHz run, which corresponds to 192 cores and about 3.4 times the throughput of the 64-core X9M-2 database node. The main memory in the fat Exadata X9M-8 database node starts at 3 TB and scales up to 6 TB. This database server has a pair of 6.4TB NVM-Express cards that plug into PCI-Express 4.0 slots, giving it plenty of bandwidth and the same networking options as the slim Exadata X9M-2 database server.
The HC hybrid disk / flash and EF all-flash storage servers are based on a two-socket server node that uses a pair of Ice Lake Xeon SP 8352Y processors with 16 cores at 2.2 GHz each. The HC node has 256 GB DDR4 DRAM, which is expanded with 1.5 TB Optane 200 series persistent memory configured as a read and write cache for the main memory. The HC chassis accommodates a dozen 18 TB 7.2 KU / min drives and four of the 6.4 TB NVM Express Flash drives. The EF chassis has the same DDR4 and PMEM memory configuration, but has no hard drives at all and eight of the 6.4 TB NVM Express Flash cards. Both storage server types have a pair of 100 Gbit / s switches to connect to each other and to the database servers in the fabric.
The first thing to note is that although 200Gbps and 400Gbps Ethernet (even with RoCE support) are available in the market and certainly affordable (well, compared to Oracle software prices certainly), you will find out that Oracle is sticking to 100 Gbps switching for the Exadata backplane. We wouldn’t be surprised if the company used cable splitters to take a tier out of the 200Gbps switch fabric, and if we were to build a large Exadata cluster ourselves we would consider adding a higher radix Using a switch and buying a whole lot fewer switches to connect the database and storage servers together. A jump to 400 Gb / s switchery would provide even more radix and fewer jumps between devices and fewer devices in the fabric.
Let’s talk about scaling for a second. Oracle RAC is based on technology that Compaq licensed to Oracle, but which was developed for Digital VAX hardware and its VMS operating system. These VAXcluster and TruCluster clustering software were very good at clustering databases and HPC applications, and Digital’s Rdb database had good database clustering working well long before Oracle – it’s debatable if you call Oracle Parallel Server can that preceded the RAC, a good implementation a clustered database. It might work, but it was a pain in the neck to deal with.
The Exadata machine offers both vertical scaling – the handling of increasingly larger databases – and horizontal scaling – the handling of an increasing number of users or transactions. The eight socket server provides vertical scaling and RAC provides horizontal scaling. As far as I know, RAC ran out of gas on eight nodes when trying to implement a shared database, but the modern versions of RAC, including RAC 19c that launched in January 2020, use a shared nothing approach for everyone Database nodes and use shared memory to parallelize processing across data sets. (There is a very good white paper on RAC that you can read here.) The point is, Oracle has worked very hard to come up with a combination of function dispatch (sending SQL statements to remote storage servers) to improve the analysis, and a mix of data sending and distributed data caching to increase transaction processing and batch jobs (which the company masters) – all in the same database management system.
An Exadata rack has 14 of the storage servers with a usable capacity of 3 PB for hard drive, 358 TB of Flash plus 21 TB of Optane PMEM for HC storage and 717 TB of Flash and 21 TB of Optane for EF storage. The rack can have two of the eight socket database servers (384 cores) or eight of the two socket servers (512 cores) for database computation. Of course, if you take out some of the memory, you can add more processing power to any Exadata rack. With the existing switches from Oracle, a total of up to a dozen racks can be integrated into the RoCE Ethernet fabric and even larger configurations can be set up with additional switching tiers.
In terms of performance, a single rack of the Exadata X9M-8 database nodes and hybrid disk / flash HC memory can handle 15 million random 8K reads and 6.75 million random flash write I / O per second (IOPS). When switching to EF storage, which is well suited for data analysis, a single rack can scan 75 GB / s per server, for a total of 1 TB / s on a single rack, the three of the eight-node database servers and eleven of the EF -Storage server.
After all, Oracle is still the only high-end server manufacturer to publish a price list for its systems, for each generation of Exadata machines. Half a rack of the Exadata X9M-2 (containing four of the two-socket database servers and seven storage servers) with the HC hybrid hard drive / flash storage is $ 935,000, and half the rack with the EF all-flash Storage costs the same. So $ 1.87 million per rack.