0

I have a query where one table has ~10 million rows and the other two are <20 in each table.

SELECT a.name, b.name, c.total
FROM smallTable1 a, smallTable2 b, largeTable c
WHERE c.id1 = a.id AND c.id2 = b.id;

largeTable has columns (id, id1, id2, total) and ~10 million rows

smallTable1 has columns (id, name)

smallTable2 has columns (id, name)

Right now it takes 5 seconds to run.
Is it possible to make it much faster?

7
  • Have you created appropriate indexes? Commented Jun 5, 2014 at 11:54
  • What are the indexes on a, b and c? Commented Jun 5, 2014 at 11:54
  • I have not created any indexes for these tables. I tried to understand how to use them but whenever I create one and test queries again, it does not successfully improve the query because I am probably using it incorrectly. Before, I tried putting the indexes on largeTable(id1) and largeTable(id2) but it did not improve performance Commented Jun 5, 2014 at 11:56
  • 5 seconds to get 10 million rows does not sound like slow. What are you doing with the results? Commented Jun 5, 2014 at 12:03
  • 1
    I'm querying against the 10 million row table with two smaller rows but the output is simply 200 rows max. I am displaying the data when a user submits and want it to show in 1 second. I precomputed the 10 million table from a 200+ million row table to make it faster for the user. The 10 million row table contains all the possible data that the user might want to see but I feel like I could make it even smaller and precompute another table Commented Jun 5, 2014 at 12:07

3 Answers 3

2

Create indexes - they are the reason why querying is fast. Without indexes, we would be stuck with CPU-only solutions.

So:

  1. Create index for SmallTable1(id)
  2. Create index for SmallTable2(id)
  3. Create index for LargeTable(id1) and LargeTable(id2)

Important: You can create index for more than one column at the same time, like this LargeTable(id1,id2) <--- DO NOT DO THAT because it does not make sense in your case.

Next, your query is not out of the box wrong, but it does not follow the best practice querying. Relational databases are based on Set theory. Therefore, you must think in terms of "bags with marbles" instead of "cells in a table". Roughly, your initial query translates to:

  1. Get EVERYTHING from LargeTable c, SmallTable1 a and SmallTable2 b
  2. Now when you have all this information, find items where c.id1 = a.id AND c.id2 = b.id; (there goes your 5+ seconds because this is semi-resource intensive)

Ambrish has suggested the correct query, use that although this will not be faster.

Why? Because in the end, you still pull all the data from the table out of the database.

As for the data itself goes: 10 million records is not ridiculously large table, but it is not small either. In data warehouses, the star schema is a standard. And you have a star schema basically. The problem you are actually facing is that the result has to be calculated on-the-fly and that takes time. The reason i'm telling you this is because in corporate environments, engineers are facing this problems on a daily basis. And the solution is OLAP (basically pre-calculated, pre-aggregated, pre-summarized, pre-everything data). The end users then just query this precalculated data and the query seems very fast, but it is never 100% correct, because there is a delay between OLTP (on-line transactional processing = day to day database) and OLAP (on-line analytical processing = reporting database) The indexes will help with queries such as WHERE id = 3 etc. But when you are cross joining and basically pulling everything from DB, it probably wouldn't play a significant role in your case.

So to make long story short: if your only options are queries, it will be hard to make an improvement.

Sign up to request clarification or add additional context in comments.

6 Comments

I'm confused on how this marked as the correct answer although the comments below seem to indicate the indexing did not help. From my point of view, if it only takes 5 seconds without indexes, it should fly with the proper indexes. Saying that if your only options are queries, it will be hard to make an improvement. sounds a lot like throwing in the towel a bit prematurely IMHO.
I also wrote that indexing will probably wont make a difference because there is no data reduction when joining. When you are pulling 10million records and joining them with two other tables, you pay the price. The database schema looks very similar to star schema which is one of the basics in OLAP. With a large table (fact table) and small and not-often-updated referenced tables (dimension tables) around it (hence the name "star" schema). OLAP cubes then aggregate data. When you query an OLAP cube, the query is fast. I also suggest reading up on indexing in SQL servers.
Thanks, I might. In the meantime I'm assuming that what really happens here is that the OP is querying/filtering on some other field(s) than the id fields mentioned. That way he gets about 200 rows back out of the 10M. It seems fair to assume that this can be easily made very fast using the proper indexes. OLAP is great, especially if you're aggregating but I don't see any grouping going on in the original query.
Hm, I dont think so. Although indexing should help, It doesn't in this case because there is no WHERE clause. If you look up "relational algebra" or look at the execution plan of the query, you will see that server always optimizes on the fly. Meaning, it always tries to truncate the data before doing the joins. In this case, there is nothing to truncate so it just goes into full blown full join and there is the performance hit. At least that is how I see it.
Oh and yes, there are no aggregations in the initial query. That is true. But there are joins. And cubes don't only aggregate, they also transform/adapt, join etc. That is why cubes are usually not 100% up-to-date because the cube processing (process of filling the cube with data) usually takes place at night or at least not on business hours. This process can take up a lot of resources and more important-time. That is why we use them - to speed everything up for the end users with the price of waiting for all possible aggregations joins etc.
|
0

There is one circumstance under which separately indexing ID1 and ID2 in the large table will make less of a difference. If there are 9,000,000 rows with ID1 matching SmallTable1.id and 200 rows with ID2 matching SmallTable2.id, with the 200 being the only rows where both exist at the same time, you will still be doing almost a complete table/index scan. If that is the case, creating an index on both ID1 and ID2 should speed things up as it can then locate those 200 rows with index seeks.

If that works, you may want to include Total in that index to make it a covering index for that table.

This solution (assuming it is one) would be extremely data-centric and thus the execution would change if the data changes significantly.

Whatever you decide to do, I would suggest you make one change (create an index or whatever) then check the execution plan. Make another change and check the execution plan. Make another change and check the execution plan. Repeat or rewind as needed.

Comments

-1

Use join instead of WHERE clause

SELECT a.name, b.name, c.total
FROM smallTable1 a join largeTable c on c.id1 = a.id
join smallTable2 b on c.id2 = b.id;

And create index on largeTable(id1) and largeTable(id2)

4 Comments

Although an excellent idea, this should make no difference to performance.
How about with indexes?
I will try this and see if it changes. I always thought that joins were no different from using WHERE but I guess I was wrong. Thanks. EDIT: I created the indices and changed the query to that format but the performance is the same
@Ambrish . . . SQL isn't executed from the text statements that are written. It is compiled and optimized to an internal form. In most cases, the use of explicit or implicit joins are treated the same way by the optimizer (there may be some extreme cases in Oracle or with lots of joins in SQL Server where this isn't 100% true), but it is close enough to being true that you don't need to worry about it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.