Home / SQL query to work out top selling categories

SQL query to work out top selling categories

I sell Linux on CD on my Linux CD Mall website, and it has a page showing the most popular Linux and BSD distributions that it sells. For some reason, I’d never set this up to query the information from the database, instead manually updating it periodically. Today I decided to actually get around to making it read the information from the database and have decided to share the SQL queries etc used to get the information for top sellers by category.

A summary of the relevant fields and tables are as follows (I’ve modified the tablenames and fieldnames from the actual ones to make them more generic to a regular ecommerce solution):

orders_header
order_id - the primary key
status - the order's status
order_date - the date the order was placed

orders_detail
order_id - value to join this table to orders_header
product_id - value to join this to products_to_categories

products_to_categories
product_id - the product's id, used to join to orders_detail
category_id the category's id, used to join to categories

categories
category_id - the primary key
name - the name of the category
url - the url for the category page

The end result required is to have a list of category names sorted in order from the most popular to the least popular. In my case, I want to show just the top 10 but the SQL query can be easily adjusted to show more. The actual database structure for my Linux CD Mall website allows a product (in this case a CD set or DVD) to belong to one category (a Linux or BSD distribution) so a category can only be counted once per orders_detail line. If a product can belong to more than one category then a category may be counted multiple times per orders_detail line, but this may be acceptable depending on the circumstances.

The query used to get the data for the previous month is as follows (assuming the current date is 11 December 2007, making last month November 2007):

SELECT c.name, c.url, COUNT(*)
FROM orders_header h
INNER JOIN orders_detail d ON h.order_id = d.order_id
INNER JOIN products_to_categories p2c ON d.product_id = p2c.product_id
INNER JOIN product p ON p2c.product_id = p.product_id
WHERE h.status >= 30
AND h.order_date >= '2007-11-01'
AND h.order_date <= '2007-11-30'
GROUP BY c.name, c.url
ORDER BY COUNT(*) DESC
LIMIT 10

In my orders_header table, a status of 30 means it has been paid and a status of 40 means it has been sent; we only want to include orders that have been paid or sent, hence the "WHERE h.status >= 30" condition.

When doing a query like this, you need to make sure all the columns in the joins and where conditions are indexed, otherwise the query may take a considerable amount of time. Adding indexes should make the queries run a lot faster. The actual query I use on my CD site is slightly more complex than the above, and was taking around 4 seconds to run, despite a lot of optimization. In the end I added a query cache to MySQL which fixed the issue for subsequent queries. You can read about this in the next post.