Optimizing simple SQL query for large table

Question

I have a query where one table has ~10 million rows and the other two are <20 in each table.

SELECT a.name, b.name, c.total
FROM smallTable1 a, smallTable2 b, largeTable c
WHERE c.id1 = a.id AND c.id2 = b.id;

largeTable has columns (id, id1, id2, total) and ~10 million rows

smallTable1 has columns (id, name)

smallTable2 has columns (id, name)

Right now it takes 5 seconds to run.
Is it possible to make it much faster?

I have not created any indexes for these tables. I tried to understand how to use them but whenever I create one and test queries again, it does not successfully improve the query because I am probably using it incorrectly. Before, I tried putting the indexes on largeTable(id1) and largeTable(id2) but it did not improve performance — nuclear
– nuclear, Commented Jun 5, 2014 at 11:56
5 seconds to get 10 million rows does not sound like slow. What are you doing with the results? — ypercubeᵀᴹ
– ypercubeᵀᴹ, Commented Jun 5, 2014 at 12:03
I'm querying against the 10 million row table with two smaller rows but the output is simply 200 rows max. I am displaying the data when a user submits and want it to show in 1 second. I precomputed the 10 million table from a 200+ million row table to make it faster for the user. The 10 million row table contains all the possible data that the user might want to see but I feel like I could make it even smaller and precompute another table — nuclear
– nuclear, Commented Jun 5, 2014 at 12:07

martin · Accepted Answer · 2014-06-05 12:19:21Z

2

Create indexes - they are the reason why querying is fast. Without indexes, we would be stuck with CPU-only solutions.

So:

Create index for SmallTable1(id)
Create index for SmallTable2(id)
Create index for LargeTable(id1) and LargeTable(id2)

Important: You can create index for more than one column at the same time, like this LargeTable(id1,id2) <--- DO NOT DO THAT because it does not make sense in your case.

Next, your query is not out of the box wrong, but it does not follow the best practice querying. Relational databases are based on Set theory. Therefore, you must think in terms of "bags with marbles" instead of "cells in a table". Roughly, your initial query translates to:

Get EVERYTHING from LargeTable c, SmallTable1 a and SmallTable2 b
Now when you have all this information, find items where c.id1 = a.id AND c.id2 = b.id; (there goes your 5+ seconds because this is semi-resource intensive)

Ambrish has suggested the correct query, use that although this will not be faster.

Why? Because in the end, you still pull all the data from the table out of the database.

As for the data itself goes: 10 million records is not ridiculously large table, but it is not small either. In data warehouses, the star schema is a standard. And you have a star schema basically. The problem you are actually facing is that the result has to be calculated on-the-fly and that takes time. The reason i'm telling you this is because in corporate environments, engineers are facing this problems on a daily basis. And the solution is OLAP (basically pre-calculated, pre-aggregated, pre-summarized, pre-everything data). The end users then just query this precalculated data and the query seems very fast, but it is never 100% correct, because there is a delay between OLTP (on-line transactional processing = day to day database) and OLAP (on-line analytical processing = reporting database) The indexes will help with queries such as WHERE id = 3 etc. But when you are cross joining and basically pulling everything from DB, it probably wouldn't play a significant role in your case.

So to make long story short: if your only options are queries, it will be hard to make an improvement.

answered Jun 5, 2014 at 12:19

martin

2991 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

deroby Over a year ago

I'm confused on how this marked as the correct answer although the comments below seem to indicate the indexing did not help. From my point of view, if it only takes 5 seconds without indexes, it should fly with the proper indexes. Saying that if your only options are queries, it will be hard to make an improvement. sounds a lot like throwing in the towel a bit prematurely IMHO.

martin Over a year ago

I also wrote that indexing will probably wont make a difference because there is no data reduction when joining. When you are pulling 10million records and joining them with two other tables, you pay the price. The database schema looks very similar to star schema which is one of the basics in OLAP. With a large table (fact table) and small and not-often-updated referenced tables (dimension tables) around it (hence the name "star" schema). OLAP cubes then aggregate data. When you query an OLAP cube, the query is fast. I also suggest reading up on indexing in SQL servers.

deroby Over a year ago

Thanks, I might. In the meantime I'm assuming that what really happens here is that the OP is querying/filtering on some other field(s) than the id fields mentioned. That way he gets about 200 rows back out of the 10M. It seems fair to assume that this can be easily made very fast using the proper indexes. OLAP is great, especially if you're aggregating but I don't see any grouping going on in the original query.

martin Over a year ago

Hm, I dont think so. Although indexing should help, It doesn't in this case because there is no WHERE clause. If you look up "relational algebra" or look at the execution plan of the query, you will see that server always optimizes on the fly. Meaning, it always tries to truncate the data before doing the joins. In this case, there is nothing to truncate so it just goes into full blown full join and there is the performance hit. At least that is how I see it.

martin Over a year ago

Oh and yes, there are no aggregations in the initial query. That is true. But there are joins. And cubes don't only aggregate, they also transform/adapt, join etc. That is why cubes are usually not 100% up-to-date because the cube processing (process of filling the cube with data) usually takes place at night or at least not on business hours. This process can take up a lot of resources and more important-time. That is why we use them - to speed everything up for the end users with the price of waiting for all possible aggregations joins etc.

|

TommCatt · Accepted Answer · 2014-06-08 02:38:44Z

There is one circumstance under which separately indexing ID1 and ID2 in the large table will make less of a difference. If there are 9,000,000 rows with ID1 matching SmallTable1.id and 200 rows with ID2 matching SmallTable2.id, with the 200 being the only rows where both exist at the same time, you will still be doing almost a complete table/index scan. If that is the case, creating an index on both ID1 and ID2 should speed things up as it can then locate those 200 rows with index seeks.

If that works, you may want to include Total in that index to make it a covering index for that table.

This solution (assuming it is one) would be extremely data-centric and thus the execution would change if the data changes significantly.

Whatever you decide to do, I would suggest you make one change (create an index or whatever) then check the execution plan. Make another change and check the execution plan. Make another change and check the execution plan. Repeat or rewind as needed.

Ambrish · Accepted Answer · 2014-06-05 11:53:37Z

-1

Use join instead of WHERE clause

SELECT a.name, b.name, c.total
FROM smallTable1 a join largeTable c on c.id1 = a.id
join smallTable2 b on c.id2 = b.id;

And create index on largeTable(id1) and largeTable(id2)

answered Jun 5, 2014 at 11:53

Ambrish

3,6752 gold badges31 silver badges45 bronze badges

4 Comments

Gordon Linoff Over a year ago

Although an excellent idea, this should make no difference to performance.

Ambrish Over a year ago

How about with indexes?

nuclear Over a year ago

I will try this and see if it changes. I always thought that joins were no different from using WHERE but I guess I was wrong. Thanks. EDIT: I created the indices and changed the query to that format but the performance is the same

Gordon Linoff Over a year ago

@Ambrish . . . SQL isn't executed from the text statements that are written. It is compiled and optimized to an internal form. In most cases, the use of explicit or implicit joins are treated the same way by the optimizer (there may be some extreme cases in Oracle or with lots of joins in SQL Server where this isn't 100% true), but it is close enough to being true that you don't need to worry about it.

Collectives™ on Stack Overflow

Optimizing simple SQL query for large table

3 Answers 3

6 Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related