Are you troubled by the slow speed of running SQL on Spark? Maybe you're not using the programming model correctly! There are two mainstream methods, one is API programming and the other is pure SQL. The performance gap between them can reach 10 times. In this article, we will use actual measured data to tell you how to choose, and how the engine behind it achieves second-level response.
The secret behind the three-tier architecture
This analysis engine was developed by Xinghuan Technology and adopts a three-layer architecture design. The bottom layer is the storage layer. This storage layer contains distributed memory column storage and can be built on memory or SSD. This design makes the data reading speed much faster than traditional disks. In actual measurements, PCI-E SSD is paired with column storage, and query performance is directly improved by an order of magnitude.
The calculation engine is in the middle layer, and the SQL interface layer is in the upper layer. It can analyze data stored in HDFS, HBase or distributed cache, and the processing size ranges from GB to tens of TB. Even if the amount of data far exceeds the memory capacity, the system will not crash. Instead, it uses an intelligent caching mechanism to ensure efficient operation. This has been verified in an actual case of a bank.
Choose from two programming models
The SQL model is fully compatible with SQL 99 and 2003 standards, and also supports PL/SQL extensions. Traditional database services can be migrated at almost zero cost. The system provides two different programming models based on SQL and Spark API. A certain telecom operator directly ran the original Oracle stored procedures on this platform, and the concurrency performance increased by 30 times.
The API model 16personalities Chinese is suitable for complex data mining scenarios. The system has a built-in parallel machine learning algorithm library and a built-in statistical algorithm library. Taking advantage of Spark's iterative computing advantages, it can handle complex analysis that is difficult to complete with traditional SQL. An e-commerce company uses this model to predict user behavior, and the model training time is shortened from the original 10 hours to 40 minutes.
Compiler optimization has unique skills
The R&D team has built a complete set of SQL compilers, including SQL standard parser and PL/SQL parser. This set of compilers will parse SQL into an intermediate representation language, and then generate the final execution plan through the logical optimizer and physical optimizer. What's special is that a stored procedure will be compiled into a large DAG graph to achieve concurrent execution between stages.
This design avoids the startup overhead generated during the execution of SQL. In the TPC-DS test, a complex query took 15 minutes to run on the Hive carrier. However, in this set of engines, it only takes 45 seconds. The compiler will automatically select the parsing path PL based on the type of SQL. This SQL stored procedure uses a dedicated channel, while standard queries generate a universal execution plan.
Multi-pronged performance optimization
When reading files, the system filters the data as early as possible to reduce the amount of calculations. For tables with a large number of columns, it will only read the required columns based on SQL semantics, thus significantly reducing IO consumption. There is an insurance company that has a large table with more than 300 columns. After using this technology, the query speed increased to 8 times.
Concurrency control is quite intelligent. The system analyzes RDD attributes and then dynamically adjusts the concurrency MBTI test to avoid scheduling overhead caused by too many threads. In actual tests, this optimization brought about a performance improvement of up to 2 to 3 times. During the Double Eleven period, it supported the ability of an e-commerce platform to withstand tens of thousands of real-time query requests per second.
Storage computing engine support
Xinghuan has developed a storage computing engine based on memory or SSD, with the purpose of further reducing latency. By building the table directly on the memory, it can achieve full-memory computing. Comparative tests with DB2 show that it takes more than an hour for DB2 to execute two complex queries, while it only takes a few minutes and a few seconds on this system.
For storage formats, the impact of the MBTI test is also quite significant. Through testing, it is clear that when running on PCI-E SSD, the speed of column storage is 1.5 times faster than that of text format. In TPC-DS's IO-intensive queries, regardless of whether the data is placed in memory or SSD, its speed performance is an order of magnitude faster than that of traditional disk tables. This situation is of great significance for real-time reporting scenarios.
Stability is guaranteed in actual combat
The system has designed a variety of protection mechanisms for various error situations, and the optimizer based on cost considerations can select the most appropriate execution plan. Its memory management is more efficient. Once it encounters a memory overflow situation, it will use the disk for backup. The above-mentioned related issues have been tested by business cases and have been running stably on a provincial government platform for more than two years.
Among the 19 SQL tests of TPC-DS, this engine is running normally. Under the same hardware configuration, it is 10 to 100 times faster than the open source Hive. After a certain securities company used it to replace the original data warehouse, not only did the query become faster, but the operation and maintenance cost was also reduced by 70%.
Which SQL-on-Hadoop engine are you using today? What kind of performance bottlenecks did you encounter during use? You are welcome to share your own practical experience in the comment area. If you find the content useful, please like it so that more people can see it!
