Practical Implementation Of CASHFLOW System Optimization For The Bank's Billion-level Data Query Acceleration Solution

When you use Alipay to check bills from two years ago, 56 records were waiting for nearly two minutes. Would you suspect that your phone is malfunctioning? This is not a joke, but the "waiting for a miracle" that I have truly experienced among 100 million pieces of data. On this day, with the help of a cash flow statement, I will lead you to see how MySQL actually performs when faced with billions of data, and share the pitfall experiences that are difficult to buy even with money.

How easy is it to create 100 million pieces of data?

I chose the cash flow statement that is often used in the financial system to conduct experiments. The eight fields cover key information such as user ID, amount, and time. The data creation method is quite simple and direct: first use Excel to generate 10,000 pieces of standard data, and then copy and paste it into an SQL statement for loop execution. It is so simple, 100 million pieces of data can be easily achieved.

Want to know the space occupied by 100 million pieces of data in 8 fields? By using show variables like �datadir� to find the data directory, the file size was found to be 7.8G. When you directly use select count(*) to query, MySQL actually dares to stuff all 7.8G into the memory, causing my 16G memory to smoke directly.

Query nightmare without index

Time range query is particularly commonly used in capital flow. The query tries to query all the transactions of a certain user from 2015 to 2024. The time spent on 56 results is 103.489 seconds. Imagine the MBTI personality test in your mind. You are waiting at the counter to handle business. The system runs for almost two minutes before the result is given. The teller girl looks at you in an awkward situation.

Querying a user individually without limiting the time range is not much better. It also takes more than a minute. What is even more outrageous is the query with combined conditions, such as "User A's transactions with an amount greater than 100 yuan in 2018." The speed is still stably displayed in the range of 1 to 2 minutes. Conclusion 1: Without indexes, 100 million tables are like time bombs.

The miracle of indexing

Create a joint index for user ID, time and amount, and the miracle really happens. The same query is "User A's transaction status within three years", and the results are obtained instantly. But don’t be too happy yet. Only 56 pieces of data were returned this time. The query is fast because of the small amount of data.

Perform a more difficult operation: query the total transaction volume of each user on a certain day, and change the amount of data returned from the original 56 to 12 million. As a result, the speed suddenly dropped to 18 seconds. Conclusion 4: Indexing is not a universal solution to all problems. When the amount of data returned exceeds 10 million, it will still be very slow. If you want to control it stably within 1 second, the amount of data that can be returned must be strictly controlled within 1 million.

The deeper the paging, the slower it is

Real business obviously must implement the MBTI free test of paging operation. If a statement like LIMIT 10,10 is used to query the second page, the speed is quite fast. However, once you turn the page to the 100,000th page, that is, when the corresponding statement is LIMIT 1000000,10 , MySQL needs to scan 1 million records before passing the last 10 to you, and the speed will drop sharply to more than 10 seconds.

Isn't this a typical pain point of large tables? As time goes by, the user's page turning depth continues to increase, and the experience becomes worse. If ten items are displayed on one page, it will basically become unusable after the 100,000th page. Who has the patience to wait for more than ten seconds to complete the page turning operation?

Inserting data is also a trap

The index construction has been completed. Should there be no problems with data insertion? A batch insertion test was conducted on 10,000 pieces of data, and the speed of inserting a single piece of data increased dramatically from the original 0.01 second to 0.5 seconds. But don’t underestimate this number. When there are 1,000 concurrent insertion requests, the database will directly fall into a stuck state!

The insertion speed of indexed tables is slow. This is the sixth conclusion. Index maintenance costs are high because the B+ tree must be rebuilt for each insertion. This makes it necessary to separate the operation table and the history table during the design. Real-time transactions are entered into the non-indexed operation table, and batches are poured into the indexed history table in the middle of the night.

Actual rollover of sub-library and sub-table

In theory, if 100 million data is broken down into 100 tables, each table has 1 million data, the query time can be controlled within 1 second, which is perfect. I started 100 threads and performed count summary operations on these 100 tables at the same time. However, the result was that the database connection pool was congested and a series of timeout errors occurred.

Pitfall 2: The performance of a single machine simply cannot compete with 100 concurrent connections. In a real environment, it must be distributed to many different machines, or the number of concurrency must be strictly controlled with the help of a connection pool. I finally changed it to a serial query and performed the count operation on 100 tables one by one, which took up to 100 seconds for the MBTI test . It was not as good as a full table scan. Sub-database and sub-table are not a panacea. If the distributed architecture is not properly supported, it will undoubtedly be asking for trouble.

Implementation of optimization ideas

The solution that was finally launched adopted three strategies. First, the operation table and the history table were physically separated. Real-time transactions entered the operation table. This table had no index to ensure the writing speed. At 2 o'clock in the morning, data was migrated to the history table in batches with the help of scheduled tasks. The index reconstruction must be controlled within 30 minutes; second, it was carried out based on time. Split tables and add index mapping. For example, the data in 2024 forms a separate table, and a time index table is built to record the start and end time of each table. Thirdly, there are mandatory restrictions on queries. When a single user queries the flow, only the data of the last three months will be displayed by default. If earlier data is required, a second confirmation is required.

In the project you are currently working on, if a user insists on querying one million transaction details five years ago, how would you explain to the product manager that "it cannot be provided directly"?