2021年3月25日凌晨0:00,我参加了PrestoCon2021全球技术峰会,在会上直播分享了阿里云SLS使用Presto构建交互式分析产品的挑战和优化。本文是演讲稿全文。PrestoCon2021日程:https://prestoconday2021.sched.com/?iframe=no,直播视频已经上传到B站https://www.bilibili.com/video/BV1hK4y1m7qe/,B站还包括其他演讲视频。
hello everyone my name is yunlei. I work at Alibaba Cloud log service team. Today I will share petabytes scale log data analysis at Alibaba group, I will share our experience with presto, the challenges we faced and the optimizations we did for presto.
So today, I will first give an introduction about our team and talk about the motivation why we provide analysis service for logging data. And then I will talk about the challenges we face. And after that I will share our architecture to achieve scalability. And also share some ways to achieve low latency and high qps or high concurrency. And after that I will talk about some additional work based on the scalable framework of presto. And finally, I will talk about the future roadmap.
First let me introduce my team, this may help you have a better understanding of the background, and the motivation of my we provide analysis service for logging data. And we are log service team, we are providing the infrastructure of all kinds of logging data inside Alibaba group., for example linux server, web page, mobile phone, or kubernetes.. and all these data from different kinds of platforms are collected to our clusters in real time. And then, consumers can consume streaming data in real time, without worrying about different platforms.
And after that, we wanted to provide service just like what engineers can do on their linux servers. For example, we found that engineers often use the command grep to search some key word from their log file. So, in order to provide the real time interactive search, we build inverted index for logging data. By using inverted index, people can search large amount data with very low latency.
And after that we notices that in this era of data explosion, even the searching result may contain thousands to millions, even to billions of rows. That is a very large number, even thousands of data is large amount of data for human being’s eye, it is not easy to fully understand what is really appending inside the large amount data. But If, we can summarize the data, do aggregation on data, take web access log as an example, if we can calculate the average response time or max response time, we can better understand the whole data. That is the motivation my we provide analysis service for logging data. And we also have a visualization service, to visualize the interactive query result, which is helpful for understanding the data trend.
So, in summary, our team provide multi-tenant, interactive analysis for petabytes scale real time logging data.
What are the challenges we faced? The first challenge is the high QPS and high concurrency. Every day, we process about 400 millions queries, at the peak hour there are about 10000 queries running concurrently. Second, every day presto process over 1 quadrillion rows. Which is huge amount of data. But we can deliver the interactive query with very low latency. The average latency is only about 500 ms. So, how do we deal with those challenges, and how to achieve that low latency? In the latter slides, I will talk about some techniques we used.
In order to provide the interactive log analysis. The first thing we did is introducing presto into our system. Many people ask me the question that, why did we choose presto in the first day. Well, I think the first and most import reason is that, presto is really really fast, very flexible for interactive query. It suits our use case. Second reason, presto has a scalable framework. We can develop our own connector/function/optimization rule which makes it very easy to integrate presto with our own storage. Last, the architecture and coding style Is very elegant. The architecture is simple, it doesn’t rely on any external components for deployment. The coding style is also very simple, it is easy for us to learn its internal implementation, also easy for us to do some modification for optimization. That is the reason why we choose presto in the first day.
In order to deal with the large amount of data ingested every day. We design our architecture like this. This is simplified architecture, only analysis related components are included. Data from different kinds of platforms are collected to our cluster, in the backend, the index worker keeps building inverted index and column format data for logging data. And then all the data are stored in the distributed file system, which is called pangu. This is the flow of data ingestion. When submitting query, users can send their query by sdk or jdbc client through our front api server, to one coordinator. We support distributed coordinators, and I will talk more detail about the design. In the backend, presto server read inverted index for filtering and column data for computing. This is the flow of querying. And presto play a key role in this architecture.
How to support querying large amount of data? Every day we process about 1quadrillinon rows. Fortunately, presto is a pure computing engine, so it is easy to decouple computing and storage separately. So that we can scale presto horizontally. Also, log data has a feature that it is immutable after ingested. And in the backend, we keep compacting small file into larger file, when the file is large enough, the file will become immutable. One immutable file contains tens of millions rows.so, each time we schedule a file, we schedule each immutable file to one presto server. And by this way, we can use a lot of presto server for one query. And thus, we can query large amount of data in one query. And the performance is also significant. If we run a simple group by query on 200 billions rows, it only takes about 20 seconds. That is a large amount of data and the latency is acceptable.
So, how to achieve low latency? There are a few techniques we used. The first technique we used is column format storage. In the backend, we keep building column format data. We all know that column storage is helpful for reducing data size to read from disk, and also helpful for vectorized execution, thus it will makes query faster.
Another technique we used is data locality. We all know that, network speed inside cluster is much faster than network between clusters. So, we deploy storage and computing engine in same shared cluster. When scheduling task, we choose a free node by the order of machine, rack, cluster.
There are some other techniques we used, for example cache, inverted index. I will talk more detail about that.
We all know that cache can make query faster, it can also help us removing some duplicated computation. And in order to take advantage of cache, the scheduling algorithm has to remember scheduling history in memory. Every time we schedule a file, we schedule each immutable file to the same node it has been scheduled before, unless the workload of specified node is much too high. In that case, we will choose another free node.
So there are three layers of cache, from the bottom up, there are raw data cache, intermediate result cache, and final result cache. The raw data cache and final result cache are common solutions. But the intermediate cache is a very rare solution. So, I’d like to talk more detail about the intermediate cache. As I mentioned earlier, file is immutable. So, every time we schedule a file, we schedule each mapper with exactly one immutable file. Then after finishing the partial aggregation operator, we store the result of partial aggregation operator in memory. And next time when we run exactly the same partial aggregation operator on exactly the same immutable file, we can just read the result from the memory, and send the intermediate result to final aggregation operator directly, without reading data from disk and recomputing the partial aggregation operator. We all know that partial aggregation operator deals with large amount of data, and only generate small amount of data. So, most of the computation happens in the partial aggregation operator. By using intermediate cache, we can achieve a faster query and also save a lot of cpu and IO resource.
What about the performance of cache? Every day, there are more than about 100 millions query hitting final result cache. The average latency when hitting final result cache is only about 6 milliseconds. So, it is really fast. What’s more, final result cache can help us prevent a lot of duplicated computations and save us a lot of resource. What about the performance of intermediate cache? For 1 billion rows. It only takes about 1.3 second when hitting intermediate cache, but it will take 6 seconds when not hitting intermediate cache. So, the comparation of two latency is impressive.
Another technique we used is inverted index. In the first day, we use inverted index for searching data. After introducing presto into our system. I push down predicate from presto to storage system, and then use the inverted index to first calculate the matched row id, and use the matched row id to read matched rows, and then send only matched data to presto. This strategy is very effective if we can use inverted index to skip some file or only read small part of file. If you run a query select count(*), that will be extremely fast by using only index!.
Every day we are processing more than 400 million queries. And the main challenge is on coordinator. This picture was taken about 2 years ago. And at that time there was only one single coordinator. This picture shows the top 4 nodes with highest CPU usage of presto, the fist picture is coordinator. You may notice that the CPU usage of coordinator is much higher than normal workers. In our cluster, one single coordinator can process 1000 queries concurrently at most. So, the single coordinator is the bottleneck of the cluster. It has the problem of both scalability and availability. In order to improve the performance of coordinator, I did a few optimizations. The first optimization is supporting distributed scalable coordinators. The second optimization is transferring data from output stage to client directly, without worrying coordinator.
Let us look at the detailed design of distributed coordinators. I design the distributed coordinators for scalability. And there are two kinds of roles in this design: the global coordinator, and the distributed coordinators. The global coordinator has only one instance. It is responsible for cluster management. For example, detect node state, replicate all the node state to other coordinators, track memory usage, assign largest query to reserved pool. The distributed coordinators are responsible for query management. For example, parse query, create logical plan, optimize logical plan, schedule task, track task state. It is also responsible for a user queue. Every time submitting a query, user send his query to one coordinator based on the hashing of user name, in this way, all the queries belonging to one user can always be found in one coordinator. This design is not 100 percent perfect, because the global queue is not supported. But it is enough for our scenario. It helps us to scale coordinators horizontally for much higher concurrency.
Another thing we did is optimizing the data transfer flow. In the original design, the coordinator transfers query result from output stage to client. So, the coordinator has to serialize query result into json format. We all know that json serialization is very slow. What if coordinator transfers large amount of data? For example, 1 gigabytes data. That will take a lot of CPU usage of coordinator. In order to optimize the performance of coordinator, here is my solution. The coordinator will tell the client all the addresses of the output stage. Because there may more than one node in the output stage. And then client will use the addresses to talk to output stage to directly fetch data with protobuf format. By this way we can make query faster and save a lot of cpu usage of coordinator.
Here is the final performance. We can process 400 million queries every day by supporting distributed coordinator and optimizing data transfer. And by decoupling computing and storage, we can support processing over 1 quadrillion rows every day. By taking advantage of column storage, data locality, cache, inverted index, we can provide interactive query with very low latency, the average latency is only about 500 milliseconds.
Besides those optimizations, we also did a few additional works based on the scalable framework of presto. The first thing we did is machine learning library for time series data. We also noticed that only logging data is not enough for analysis. Sometimes, we need to join logging data with some external data. So, I developed a connector to read object file on OSS storage. We also use presto as a federated SQL engine.
In the future, there are still a lot of things to do. The biggest challenge is still on coordinator. We will keep optimizing coordinator. Another thing we want to do is increase the availability of discovery, by now, discovery has only one instance. We also want to improve the performance of data exchange by using rpc protocol, which will make it faster to shuffling large amount of data.
If you are interested in data analysis engine or OLAP engine, don’t hesitate connect me, our team are hiring OLAP engineers, let us work together to make data easy to understand!