Wednesday, April 2, 2014

Query plan risk estimation -- 'Query Fudge Factor'

Hi all,

Recent discussion on PG-Hackers has led me to this blog post. The concept of query risk estimation is a concept which shall again trod down the same path i.e. the path of improving the estimation done by the query planner.

The idea started with a discussion on planner hints and why PostgreSQL doesn't have them or shouldn't have them. The main part is that the user cant be allowed to give a plan to the query planner and force its execution, but the user knows more about the data than we do, hence there may be patterns in the data that can lead to good/bad plans and that we aren't aware of.

Lets take an example. One of the most common examples we see is the lack of correlation between multiple attributes in a table. The data for many use cases is some real world data (insurance data for e.g.). Now, PostgreSQL query planner does not take cross column statistics into consideration and hence there may be patterns that we are missing.

Is allowing the user to tell a plan or even a part of plan and the planner following only that plan really a good idea? Maybe not, but we could go and actually get some hinting from the user in place for the selectivity it expects for joining attributes. We shall not have the hints that Oracle has. We can have something which is much cleaner and easier to use,though. That will need more thinking and planning than what we currently know.

There is one thing that can be done which can prove helpful though. There is a source of problems that arise in the query planner due to a reason I shall explain below:

Suppose a table is made with insertion of some data. The query planner maintains data about the distribution of values in the attributes in the table. The data is maintained in two forms MCVs and histogram of the remaining values. MCVs are Most Common Values are values which are most frequent values in the attribute and the exact frequency of MCVs are maintained.

The statistical details maintained in query planner are updated by executing ANALYZE command or are updated by VACUUM.

Suppose the table is inserted with a lot of data/ deletion of a lot data that happens suddenly. The autovaccum run is still pending and hence the distribution in the actual data has changed a lot.n The known stats in the query planner are outdated now and selectivity estimates on those query planner stats will be wrong. Very wrong.

This is not the only case. For e.g., as mentioned by Tom lane, LIMIT assumes a normal distribution of the data and then plans, which can be quite incorrect.

A recent discussion about that was introduction of a query plan risk estimation. The point is to add a new factor in query planning which is the amount of confidence we have in the query plan or selectivity estimate the query planner has just given.

One way is to use a statistical method to predict the amount of deviation of the current actual histogram of an attribute of a table from the known stats in the query planner. The farther the distance, the lesser the confidence in the plan. The same can be achieved using Central Limit Theorem.

A method I proposed is:

 I havent researched it into deep yet but plan to soon enough. I would love to get some feedback on this though.

I hope that we have something on the lines of query risk factor in soon so that our query plans are better performing.



Monday, March 31, 2014

Life, the universe and everything

Hi all,

I have had been aiming to write the current blog post for quite a long time now. I was recently asked to write up something on placement techniques and how to prepare for placements. I thought of writing this post NOT as a placement help blog post but as a post I have been meaning to write for some time now.

Please note that the current blog post isnt directly what you expect for placement help, but will help you in getting you a job if you do try to make an effort and follow your heart. Also, the blog shall require patience to read and follow. Please bear with me.

The title of the post refers to a famous programming problem with the same name. It also encompasses the spirit of this post. This post encompasses what a major part of my world constitutes of.

I shall start with a small story. Circa 2010, a nerdy boy was sweating it out because he was bad at programming. He couldnt do a problem to print star pattern or anything simple. Hell, he couldnt even figure out simple problems that involved brute force technique. His parents are quite accomplished programmers (they had two startups under their belts which they had done 14 and 17 years ago). As a student of first year in B.Tech Information Technology, he had no idea where he was heading.

His mom sat down with him and taught him some basics. She showed him how to think and approach problems. The boy started with the star pattern, then developed an interest. He tried solving more problems, gained more interest, solved more problems, learnt more, tried more and went along.

Fast forward to today, the boy is writing to you through this post. I am no big star or something, but there is one accomplishment I am proud of. I am an obsessed learner now, and my hunger grows everyday.

What drives me to work in office, come back and work in PostgreSQL? What drives me to take books and read them whenever I can? What drives me to discover, explore, invent, have fun?

The feeling of love. The feeling of joy. The feeling of ownership. The sense of accomplishment when building something. The feeling of constant learning.

This is what I want to share today. I know its a pretty cliché thing I am telling.

There is learning where there is love. Do not aspire for a job where you are not in the domain which you love.

If you are reading till here, I think that you love what you do and think that this is where you should be.

The next thing is what to do when you know you love this. Technical advice:

1) Read as much as you can. Read Cormen, The Algorithm Design Manual, Tardos, Sedgewick, Sahni, Trembley and Sorenson. Pick out specific topics from each and read.

2) Practise. Solve any problem that comes in your way. The thing is that you have to think of different approaches for each problem and not stay on one solution. Also, try to optimize your solution. For optimizing your solution you have to identify the areas where your solution is slow. Break down your problem and solution into tangible components and try to optimize them each and also try to see how the multiple components interact with each other.

3) Think about the problem. Do NOT start solving on the first step. Think and re read the problem. List the input and output and then see if there is anything you can see, a pattern or anything.

In the end its all about what you feel is the right way for your brain to engulf the problem and give a solution workable and good. The level of goodness depends on your gut.

Above all, it all boils down to one thing. It is to have fun. You have to have fun in order to do anything sane and last long enough to make an impact in any field.

If you feel that the work you are doing completes you, if you feel what you are striving for is worth it, and if you feel you can stick to it, you shall find the way which is correct for yourself.

Be open, have an open mind, debate and  take a stand for what you think is correct. Listen to your instinct, it knows the solution to the problem you are trying to solve.

For more information on something relevant read the following:

This is the best post about our trade I have ever read. I hope you find it useful.



Friday, November 15, 2013

Multi variate queries and you

Hi all,

A long time since my last post.I apologize, I have no explanation for the same. I was busy in other stuff. 

Today we will be discussing queries that involve multiple columns and their performance. Consider queries like:

SELECT * FROM table1 WHERE table1.a=1 AND b=1;

Let's get into a bit of details about what happens inside the database when such a query like this is executed.

In traditional relational databases, the query planner has no knowledge of any correlation between any two columns present in a join.If the columns are not correlated, the query performance can be good. However, if the columns are correlated, the performance of the query can be bad.

The planner keeps an arbitrary value as the selectivity estimate for the join in some cases. For these cases, the performance can be really bad.

The main problem is that the database planner does not take into consideration multi variable correlation. The query essentially queries per column and then unifies the results.

If we treat each query as a search keeping all the query column values in mind, this can lead to a single search throughout the search space (which can be larger than the traditional case,though).

This idea has to be explored, and I think I have some leads here. Sounds interesting? Think,think. Wait for the next blog post for the results :)

Until next time,



Tuesday, July 9, 2013

Make me a data store please!

Today, I am going to talk about something which requires a bit of creativity, intuition and experience, as well as a *lot* of research(doesn't everything require that?). I will start with two *real* real world cases(yup,they are absolutely real and happened in real life) and then move on to discuss the point I am trying to highlight there.

Case 1:

We got a case on PostgreSQL IRC where a guy was trying to store his massive amount of data in a postgres database, and wanted to optimize his queries's execution times, so wanted to know the best way to design his schemas and tables. We asked him about his data, and he wanted to store his data in
hstore data type i.e. in key value pairs form. When a lot of pairs are stored in a single hstore value, it can get a bit inefficient, especially when you have a lot of writes, since a change in any key value pair will lead to a write of the entire value to the disk. So, suppose we have 100 key value pairs in one store value, a change in any of them will lead to a write to disk of the entire 100 key value pairs.

I recommended him to go and use a graph database instead, as his data had a lot of connectedness in it. That seemed to the right idea to him, and he went off with it.

Case 2:

A friend of mine, who works for a scientific development organization, recently had some queries regarding the design of his data storage. He had a lot of key value pairs sort of data and wanted to store them in a form where his queries get efficient(same thing as everybody,eh?). The punch was that he wanted postgres to store his main data, as he wanted the stability and maturity of postgres( :) ) but wanted his queries to become better.

I told him to use a NoSQL database as a cache (Memcache would be good here, I believe) and keep the hottest key value pairs there. This would require some work for cache invalidation and updating, probably manually or through some application, but would work for him.

Now,what exactly am I trying to highlight here?

The main point I am trying to highlight is that *data storage requires thought*. You cannot do the same design as your next door neighbor and expect it to work for you as you would expect.

Another thing it requires is flexibility. If your data design till now used relational databases, it may or may not be necessary that your data will fit in nicely in the relational model anymore. And vice versa.

One more point I would like to discuss is scalability. Of course, when you start your video sharing website, do not plan on the scale of YouTube (at least, not yet). But, do not design and plan strictly according to your current workload. Think for the future, and design your data stores with the future in mind.

Another thing I commonly observe is how people try to use relational databases (I have seen that for postgres, I cannot speak for other relational DBMS) as NoSQL, or graph databases. This hurts me quite a bit. The fundamental thing that people have to understand is that relational databases are built on the principals of relational algebra and the relational model. If we try to violate the principals of relational model, we are trying to violate the fundamental principals that run our database. If we persist in doing that, how can we expect the performance that relational databases normally deliver?

Think of it like a sandwich. Take the bread, which has been developed, refined, tested with time and holds the experience of a lot of bakers, and add your own toppings till the sandwich is good enough to make you very,very happy. But do not try to make whole wheat bread taste like honey bread, and vice versa.

Next time, when you design your data stores, think, think, and think.

Till next time,



Friday, July 5, 2013

Scale Me Up -- Musings of a scalability researcher

Scalability. What does it imply?

More connections? Yes
More traffic? Yes
More concurrency issues? Yes
More performance issues? Yes

My work at EnterpriseDB leads me down a path less trodden, the intricate and fine details of scalability.
My musings lie here, in this blog post, which you are reading currently.

Having recently completed the connection throttler project in PEM, I moved on to research optimal architectures and protocols to establish a connection between a client and host databases running on different servers, with PEM server as the intermediary. The architecture had to be scalable. This was a new experience for me, as I had to research network protocols which may work well for our use cases. I was a bit intimated at first, but then remembered a quote I had read somewhere:

"If you are getting comfortable, move out."

This is what lead me to get down and start working on it. A lot of interesting points came up, and I learnt some things which are invaluable.

I was aware of the standard protocols, SSH, TLS for security etc. Dave advised me to go and research IPSec tunnel as well, which I did.

Some points here:

1) When looking at protocols for scalability, look at the performance. See how much time computations for the protocol would take, and how much CPU may be used for the same. To take an example, in my research, I found that TLS's authentication through PKI is time consuming and CPU consuming, compared to IPSec's pre shared key, which is much faster.

2) Look at the protocols your protocol is dependent upon. For e.g. IPSec depends on UDP, which does not support fragmentation, and hence cannot support large data packets. This is a disadvantage over TLS and TCP.

So, again, you have a plus on each side, and a corresponding minus on each side as well.

The above points are pretty specific to a part of scalability components, but demonstrate an important point, which I am trying to highlight here:

There is no clear answer.

Wow, where did that come from?

Well, I feel that this is the answer to most of the things we do, decision we make. Every component, every option has its own pros and cons, and we need to evaluate everything on the same level, with an open mind.

All solutions are applicable, we just need to find the one that best suits our use cases and work loads.

I will keep you all updated.



Thursday, June 20, 2013

Highly Concurrent Systems --What are they, how they work -- Part 1

I am currently working as an intern in EnterpriseDB. Here, at EnterpriseDB, I am involved in solving scalability issues for PEM, EDB's major product which is a monitoring tool for your Postgres databases.

This is the first time I have been working on such a highly concurrent system. It is a different experience altogether, and the issues faced are very different from what you normally get when working in a single threaded system.

First up is number. Highly concurrent systems, as the name suggests, usually have a high number of concurrently executing threads, which is something that leads to other stuff, if not issues.

Then, control. Highly concurrent systems can get out of control quickly, and if not planned and designed in a well thought manner, can lead to unpredictable behavior. Ensuring serial behavior in such cases is one such potential challenge faced here.

Then, performance.Contrary to common belief that threading *always* leads to better performance, in some cases, the thread overhead and maintenance can lead to performance lags, which can hit your business.

Then, finally, serialization and concurrent access control, and lock management. This is one issue that is usually faced by every concurrent system, not only the high concurrency ones. For highly concurrent, this becomes all the more challenge, and ensuring that locks are acquired and released at appropriate parts, and ensuring that they are released, becomes another potential challenge.

My current project works on limiting the number of active connections. This leads to designing a system which efficiently queues the waiting threads i.e. those threads, which cannot be granted a connection immediately.

Now, when you start modifying the existing behavior, you need to ensure that you don't break existing functionality. Not only break it, you shouldn't affect the currently running  cases much. One way of doing this(as I have done in this project), is to use the existing code and build over it. Some modifications may be required, but reusing existing code, which has been tested over time and will work is mostly always a good idea. It also saves you from replicating functionalities, and is a good practice in general.

Another point to look at is testing. For such a highly concurrent system, you need to ensure that your functionalities run for large test data, and it works *over time*. What I mean here is that highly concurrent systems may have some different order of execution over multiple runs of the same functionality, and hence, your functionality may have to deal with different orders, and other stuff. You need to test to over time.

Also, you need to test other important things, like order. One important specification in my current project is to ensure that the order in which the threads come in is the order in which they are executed later on, after they have permission to establish connection. This needs some thinking, and some good test cases.

These are the basics. Lets think over them, and continue our discussion in the next blog post.



Saturday, May 25, 2013

Inverse distribution functions in PostgreSQL

The idea is to implement inverse distribution functions,calculating percentile of continuous and discrete values,and median calculation.I have started a bit of research,and hope to get more done as I progress.

I will keep you posted.