Atri Sharma's Blog

Thursday, December 27, 2018

Introducing GraphHeuristicPlanner -- Prototype Planner for Generic Graph Languages

For the last week, I have been prototyping a graph planner and executor for graph languages. Specifically, I have been motivated to pursue this line of thought for Gremlin, but the engine being designed is generic to support any finite grammar. This has been demonstrated by supporting a dummy grammar in the initial prototype. The prototype has support for Gremlin.

Dummy Grammar

The grammar supported as the prototype is known as "DummyGrammar". The structure of the graph language supported is as below:

NodeName.WhereClause.MatchesClause

Parser

The parser parses the raw query and populates an AST for further processing.

Heuristic Planner

The heuristic planner works in two stages:

1) First Phase Optimizer: First phase optimizer applies heuristics on the original AST and allows the AST to become more structurally optimized. These optimizations can be language agnostic.

2) Second Phase Optimizer: Second Phase Optimizer is responsible for the actual generation of the plan for execution. Second phase optimizer takes dynamic factors into account, such as the current memory usage to take decisions about the best execution strategy for the search.

Executor

The executor works by creating candidates for the next partial solution using the chosen execution strategy as defined by the heuristic planner. However, the executor has a dynamic feedback mechanism, where the executor has the flexibility to generate candidates using different heuristics suited for different conditions within the same algorithm. For eg, within Hill Climbing algorithm, a choice of simple heuristic vs steep climbing heuristic may be done by the executor during execution by generating candidates using both heuristics and then assigning them a cost. This allows for any number of heuristic functions to be dynamically added and used within a search algorithm, thanks to the service based design of the cost factory referred to by the executor.

ToDo

The project is in no way complete and needs extensive tests, planner heuristics and better execution in terms of multiple heuristics present.

The objective of the prototype is to demonstrate how a cost-based heuristical model can be applied to a graph traversal system, thus allowing a cost-based optimizer and declarative queries on top of graph-structured storage systems. This will open new avenues in the space of graph query languages, especially around ease of usage, performance and analytics.

Code

Github Link

Wednesday, November 23, 2016

The Power of Intimacy

A question I get asked a lot is the magic formula to do great work. There is an inherent charm in believing that an exotic potion is all one needs to surpass himself, make his mark in the world and live a fulfilling and complete life.

However, the wiser of us know that there is no secret elixir hidden somewhere in the woods. We all know the standard recipe for success, which includes a heavy dose of hard work, a large dollop of perseverance and a dash of madness.

Let us now focus on the components discussed above. Madness needs a daredevil attitude, a love for the unexpected and a high degree of confidence in self. However, the other two attributes can be a bit elusive especially when demanded over a long period of time, consistently.

The next big question is, then, how does one grab the bull by the horns? I commiserate with the reader if he has felt the pangs of frustration when trying to be persistent and keeping the work levels at the maximum. Maintaining hard work and perseverance can be frustrating, especially if the results are not immediate and tend to be over the horizon.

What is the key then? The key is to love your task so much that you forget all but your craft. The key is to feel so passionate about the task at hand that any distraction, any shiny short term glory seems devoid of sheen when put in contrast with the magnificent glory of your innate fire which burns only for the task.

The road to Rome is not so simple, though. How does one love a task so much? How does one ignite the fire of passion which will keep burning without being quenched by greed and glory? How does one truly feel the power of desire?

The key to that is a key to not only greatness in work, but greatness in all aspects of the life. That key will allow you to feel the love for many critical parts of your life, and when you put in hard work, perseverance and a ting of madness in those parts of your life, a legendary and happy life will greet you with open arms.

The way to love something or someone is to appreciate the beauty that lies inside the object, person or task. The way is to see what only you can see, to gaze in wonder at the magical components of the subject and to truly appreciate the chance to witness such a wonder.

Which brings me to the most important point of all. To witness such a beauty, to enter the inner sanctum, you need to know the subject to a deep level, and to know something or somebody to the core, you need to intimately engage with them.

Take a look at a lego box. The small insignificant pieces that are scattered around, when combined in just the right way, build an astonishing piece of marvel. It is the same with anything else. The small components, gestures and emotions which might be touted as insignificant and not deserving of your priceless attention might be the cogs in the machinery of magic that you wish to build.

When interacting with your life, be passionate and inherently curious. Know the inner details of your task. Know exactly how the radio that you are trying to fix works. Know the exact emotions that someone important in your life will be feeling when faced with a situation. Be intimate to a level where the object or person has no qualms in spilling their inner workings and feelings to you, so that you become a piece in the jigsaw puzzle and help in making the object or the life of the person important to you, better.

There is an enigmatic charm and joy in knowing that you connect with a piece of nature to the core and can contribute to the improvement of the same. When you focus all your attention and curiosity at something and someone, the outer layers just keep falling away and what is left is what matters.

Peace

Monday, June 27, 2016

Resource Queues 2.0 -- The Next Version of Resource Management in Greenplum

Hi All,

This post covers my work around the next version of Resource Queues in Greenplum. Resource Queues are resource management for queries within Greenplum. This basically means that any query (except superuser queries) need to be run within limits of a given Resource Queue. This ensures that resource utilization is limited and can be monitored and modified. This also allows per role based resource usage prioritization.

Resource Queues allow the defining of three resource limitations. Number of queries that can be run by a role at a given point of time, the maximum amount of memory that can be consumed by all the queries running concurrently under a user at a given point of time, and the maximum cost that can be allowed for a plan selected in per role basis. So you can effectively limit the three parameters per role and that can be modified as you go ahead.

The first implementation of Resource Queues had some major issues around deadlocks. I will explain the problem further down.

Resource Queues use database locks inside Greenplum. What that implies is that the locks taken by Resource Queues is the same as being done by database shared objects, such as tables. Now, Greenplum acquires relation locks at various points in the query lifestage. For eg, in query execution. Resource Queue limits are evaluated after the planner stage, so that Resource Queue planning cost limits can be evaluated. If the resource limits are being exceeded by the current query, the current query will be made to sleep until another query releases the resources being used and then the current sleeping query will be woken up and the resource requirements will be checked.

Now, there are some cases where this steps on each other's toes. For example, consider the following case:

Query 1 starts
Query 2 starts

Query 1 is made to sleep due to excessive resource usage
Query 2 gets the Resource Queue slot lock

Query 1 is waiting for the Resource Queue lock.
Query 2 is waiting for a relation lock held be Query 1.

This leads to a deadlock.

This was a long standing problem.

The solution implemented in my new approach is as below:

Get the Resource Queue Slot lock prior to any other potential lock. This can be at the time of query admission control. This has the problem that the cost based checking cannot be done since that is a post planner stage task. For this case, the solution entails holding the lock till that stage but using a waiter queue to manage the resource queue slot locks. This basically means that if cost based checking exceeds the limit for a specific query, the query releases all the locks it holds and enters a waiter queue. Now, any other query waiting for the lock shall be ahead of the current query in the waiter queue, so will get the lock first. In the above case, Query 2 will hold Resource Queue Slot lock and relation lock so it can proceed.

Query 2 will then release its locks and Query 1 will get both Resource Queue lock and the relation lock so it can proceed.

This removes the non deterministic locking that happens around Resource Queue Slot locks and shared database object locks. The core reason is that relation locks are acquired sometimes during the parser stage as well because we do not want underlying relation to change as we proceed. This leads to a circular deadlock issue.

The pull request for the implementation is: https://github.com/greenplum-db/gpdb/pull/739

This serves as a good PoC to demonstrate resource management without deadlocks. Ideally, Resource Queues should not be using the same shared database object locks since they are essentially not the same but that is a different problem space.

Tuesday, September 23, 2014

Nash Equilibrium for Inter Dependent Repeating Decision Making Events

Hi All,

The concept of Nash Equilibrium is well known. The idea is that the outcome of multiple decision making activities is dependent on the best decision taken by each decision making party and considering the decision by every other decision making party.

Quoting Wikipedia, if Bill and Amy are playing a game, if Bill makes his best possible decision taking into account Amy's decision and Amy makes her best possible decision taking into account Bill's decision, Bill and Amy are in Nash Equilibrium.

Now, consider that Bill and Amy are meeting for the second time. In the first meeting, Bill and Amy did not have a great experience. Say Bill offered Amy a gift and Amy did not like the gift. The second time Bill and Amy meet, there may or may not be a partial overlap of the original input set (Bill may or may not offer Amy a gift). However, Amy's interpretation of the second event will be influenced by the outcome of the previous meeting.

Consider an advertisement rebidding auction. An intelligent system shall take into consideration the results of previous bidding outcomes. A consistent winner shall have a bigger constant of victory probability.

Based on mixed strategies Nash Equilibrium, where players choose a probability distribution, simply put, given a set of repeating interactions between a set of decision making parties with events which may or may not have overlapping input sets, each event shall lead to a certain amount of biasing of the probability distribution towards the winner.

This can help in modelling real world scenarios where improving the probability of output in a mixed strategies Nash Equilibrium state for repeated interactions between same decision making parties.
A bias towards successful interactions can lead to better and higher winning probabilities in the equilibrium state.

Nash Equilibrium assumes a same state of initial influence of each decision making party. The above model can help if the initial states are itself biased i.e. if one decision making party has a higher influence, the bias can be more towards him. However, if the party does not have an optimal equilibrium, the biasing can shift away hence normalizing the distribution.

The system is essentially introducing a 'feedback' mechanism to the decision making process for further repetitions. The probability distribution functions shall be arranged in a manner that any further interactions shall have a higher bias towards the upper half of the probability distribution spectrum.

Essentially, the idea in the theory that I am talking about in this post is that Nash Equilibrium does not have Markov property hence can be fruitfully biased based on previous events to have a better probability distribution.

I ran some tests for the theory and the results seem to be inline with what I researched above. The biasing constant can be a constant for each decision making party involved or can be experimentally determined for each decision making party involved or can be started off with a basic value and changed (learned) over time.

Of course, it is a pretty basic form and I shall have to put in more research to develop this further and submit it for peer review journals, but the initial outlook seems good!

Feedback would be great.

Peace,

Atri

Tuesday, July 15, 2014

Time to retire B Tree from database storage? Rise of cross correlation and the need for change in design

B Tree has been the standard data structure to store data in a database. Pages for a database are maintained in a BTree with various semantics. BTree has been the stable choice for such applications for a long time. BTree is perfect for many reasons. It has more width, allowing for lesser disk access even for a large number of pages, allowing for better storage structures, faster access of pages.

But, BTree has a drawback, which is actually a drawback for all one dimensional data structures. BTree gives no way to track correlations between multiple attributes or capture the "one-ness" of the various attributes present in the same tuple.

Building a BTree index on top of the data doesnt help as well. All attributes are treated independently and the data is arranged per page or per attribute.

The problem here is that the current design can lead to query planner not knowing anything about correlated data or the possibility of correlation between the actual data present in two attributes of a relation. This can lead to query planner not using the correlation to generate more optimal plans.

Let us discuss a case where this kind of design can lead to sub optimal query plans. Consider the following data:

Car	Age (months)	Minimum Stopping at 40 kph (metres)
A	9	28.4
B	15	29.3
C	24	37.6
D	30	36.2
E	38	36.5
F	46	35.3
G	53	36.2
H	60	44.1
I	64	44.8
J	76	47.2

A strong correlation is present between car age and minimum stopping at 40 kph. Looking for a query which could be a part of an analytical application. A predictive application trying to look at predicting the next tuple should get plans based on the already existing correlation between car age and minimum stopping at 40 kph.

If we use a data structure like Doubly Chained Tree (http://en.wikipedia.org/wiki/Doubly_chained_tree), we can effectively capture the data that is "multi dimensional". The capturing can be as simple maintaining pointers to data items for the same attribute to having a complex design capturing the actual correlation in terms of median and other statistical terms.

Using a multi dimensional data structure also allows for faster lookups from disk for queries that involve fetching large data for a single attribute and then joining it to another attribute of the same table. Using DCT, we can simply go and fetch the values for both the attributes simultaneously, using multiple pointers present.

This shall require a lot of research. Lets hope we get positive results for this one.

For more details, please look at: https://github.com/atris/Concerted

Wednesday, May 21, 2014

The Unicorn Chase

Unicorns.

The mythical, mystical, beautiful inhabitants of our fairy tale world who represent love, beauty, peace, passion, glory, happiness and the general attributes of everything good.

Unicorns are supposed to be pure, untouched, innocent. They represent everything that the human spirit can think of. Deep down within, we all wish to be in a state as the unicorns. Happy, contented, carefree, and masters of our selves.

This leads us to the unicorn chase. Unicorn chase represents the run after happiness that we are all engaged in. We try to get to a point where we can be happy in totality. Hence, we start running after 'Unicorns' or goals, then bigger goals, then even bigger goals.

While that is never a bad idea, one thing that we forget is that the path of life does not mandate us to seek happiness in all aspects of life. Happiness does not mean having the best in *all* parts of our life. We try to achieve a state where everything we think and do should be completed, achieved, and found in glory.

Sit back and think when was the last time you tried appreciating the beauty of the voice of your friend, notice the small sounds that occur outside your window, or just breath freely, openly, and in a carefree state?

Happiness is different for each one of us. We try to achieve happiness by the de facto standards, which isnt really a good idea. We must all find our own unicorns, then achieve them.

Our chases are different, and they arent really a chase. Its a path, where you tread, move, and reach the destination. You shall find your unicorn standing there.

Sometimes, we try to drive life beyond a point and forget that there are somethings that happen totally for our good, even if we dont realize that at this point of time.

For e.g., when I did not get into a top grade engineering college, I was a bit disappointed, like anyone else in that situation would be (although since my expectations were low, I was not that sad). I thought that the way I could be happiest or even happy is if I get to a top grade institution.

When I joined JIIT, I was happy but that pang of sadness hadnt really gone away. With time, what I did realize was that JIIT was probably the best thing that happened to me. I could stay with my parents at home, and since my home is close by, I could have the fun of hostlers as well. This gave me the time to focus on my studies and work and achieve things that I felt were important. Also, JIIT gave me the freedom to work on things I liked, and passing wasnt such a hard task there, so I had ample time to go and chase my unicorn.

I am not saying that I wouldnt have been happy at some more prestigious institution. I would probably have more avenues and opportunities to do things there, but would I be upto the challenge? Would I really be at home in a bigger institution? Would my priorities, like freedom, autonomy, anytime access to anybody in the institution be present? Would my unicorn be waiting for me there?

We normally do not get all that we seek in a single opportunity. If we keep looking for the opportunity that gives us everything we want at once, we would probably get nowhere. What we need to do is identify what matters to us, who matters to us and what we are willing to sacrifice. Then, take the path to our destiny and achieve our unicorn.

Never be ashamed to be scared. Never be ashamed to be emotional. Never be scared to dream bigger. Never be scared to love passionately. Never be ashamed to recognize and identify your weakness.

Above all, never be scared to admit that your unicorn is imperfect. It may be imperfect, but it is yours.

Happy traveling down your path to achieve your unicorn.

Peace,

Atri

Wednesday, April 2, 2014

Query plan risk estimation -- 'Query Fudge Factor'

Hi all,

Recent discussion on PG-Hackers has led me to this blog post. The concept of query risk estimation is a concept which shall again trod down the same path i.e. the path of improving the estimation done by the query planner.

The idea started with a discussion on planner hints and why PostgreSQL doesn't have them or shouldn't have them. The main part is that the user cant be allowed to give a plan to the query planner and force its execution, but the user knows more about the data than we do, hence there may be patterns in the data that can lead to good/bad plans and that we aren't aware of.

Lets take an example. One of the most common examples we see is the lack of correlation between multiple attributes in a table. The data for many use cases is some real world data (insurance data for e.g.). Now, PostgreSQL query planner does not take cross column statistics into consideration and hence there may be patterns that we are missing.

Is allowing the user to tell a plan or even a part of plan and the planner following only that plan really a good idea? Maybe not, but we could go and actually get some hinting from the user in place for the selectivity it expects for joining attributes. We shall not have the hints that Oracle has. We can have something which is much cleaner and easier to use,though. That will need more thinking and planning than what we currently know.

There is one thing that can be done which can prove helpful though. There is a source of problems that arise in the query planner due to a reason I shall explain below:

Suppose a table is made with insertion of some data. The query planner maintains data about the distribution of values in the attributes in the table. The data is maintained in two forms MCVs and histogram of the remaining values. MCVs are Most Common Values are values which are most frequent values in the attribute and the exact frequency of MCVs are maintained.

The statistical details maintained in query planner are updated by executing ANALYZE command or are updated by VACUUM.

Suppose the table is inserted with a lot of data/ deletion of a lot data that happens suddenly. The autovaccum run is still pending and hence the distribution in the actual data has changed a lot.n The known stats in the query planner are outdated now and selectivity estimates on those query planner stats will be wrong. Very wrong.

This is not the only case. For e.g., as mentioned by Tom lane, LIMIT assumes a normal distribution of the data and then plans, which can be quite incorrect.

A recent discussion about that was introduction of a query plan risk estimation. The point is to add a new factor in query planning which is the amount of confidence we have in the query plan or selectivity estimate the query planner has just given.

One way is to use a statistical method to predict the amount of deviation of the current actual histogram of an attribute of a table from the known stats in the query planner. The farther the distance, the lesser the confidence in the plan. The same can be achieved using Central Limit Theorem.

A method I proposed is:

http://www.postgresql.org/message-id/CAOeZVieVa+V-6NYzLLofG6uXfs9Y7N4cUzeoYrYKaQg23nohuw@mail.gmail.com

I havent researched it into deep yet but plan to soon enough. I would love to get some feedback on this though.

I hope that we have something on the lines of query risk factor in soon so that our query plans are better performing.

Peace,

Atri