Welcome Guest, Not a member yet? Register   Sign In
Advanced MySQL Question - Retrieving Similar Items Based on Tags
#1

[eluser]danoph[/eluser]
I have products on my website and I show similar items based on the tags for those products. The problem is, the larger the database gets the more of a bottleneck query this becomes. Right now it sorts the results by items having the most matching tags as being most relevant. Does anyone know how to fix this query so it is more scalable? Is there an easy fix that I'm not seeing? The query and tag table structures are below:

Code:
SELECT distinct item_id, count(*) as matches
FROM (item_tags)
WHERE tag_id IN ('11196', '18447', '10479', '10003', '10010', '11197', '10080', '13401', '10059', '18790', '11123', '17837', '18435', '18428', '16590', '11875', '16566', '12514', '11553', '11210', '12751', '13225', '10019', '22981', '11388', '14425', '10270', '14278', '18455', '13895', '10860', '10844', '13183', '18022', '10023', '11206', '19271', '18444', '12362', '10071', '10069')
AND `item_id` != "125400"
GROUP BY item_id
ORDER BY matches desc, item_id desc
LIMIT 10

Code:
mysql> describe tags;
+-------+--------------+------+-----+---------+----------------+
| Field | Type         | Null | Key | Default | Extra          |
+-------+--------------+------+-----+---------+----------------+
| id    | int(11)      | NO   | PRI | NULL    | auto_increment |
| tag   | varchar(255) | YES  | UNI | NULL    |                |
| count | int(11)      | YES  |     | 1       |                |
+-------+--------------+------+-----+---------+----------------+

and...
Code:
mysql> describe item_tags;
+---------+---------+------+-----+---------+----------------+
| Field   | Type    | Null | Key | Default | Extra          |
+---------+---------+------+-----+---------+----------------+
| id      | int(11) | NO   | PRI | NULL    | auto_increment |
| item_id | int(11) | YES  | MUL | NULL    |                |
| tag_id  | int(11) | YES  | MUL | NULL    |                |
+---------+---------+------+-----+---------+----------------+

item_tags indexes:
Code:
mysql> show indexes from item_tags;
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table     | Non_unique | Key_name         | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| item_tags |          0 | PRIMARY          |            1 | id          | A         |     3803470 |     NULL | NULL   |      | BTREE      |         |
| item_tags |          1 | index_items      |            1 | item_id     | A         |      118858 |     NULL | NULL   | YES  | BTREE      |         |
| item_tags |          1 | index_items_tags |            1 | tag_id      | A         |       52102 |     NULL | NULL   | YES  | BTREE      |         |
| item_tags |          1 | index_items_tags |            2 | item_id     | A         |     3803470 |     NULL | NULL   | YES  | BTREE      |         |
| item_tags |          1 | index_tags       |            1 | tag_id      | A         |       52102 |     NULL | NULL   | YES  | BTREE      |         |
+-----------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

The query takes from 1-3 seconds consistently in the codeigniter profiler and ties up system resources while processing. It can really bring our server to a crawl especially with high traffic! Can anyone help? Thanks!
#2

[eluser]frist44[/eluser]
Having you tried removing the 2nd layer of grouping and do the counting in PHP with a foreach or something. I can't imagine the first part (distinct....where in) should take that long. I do a similar thing for searching keywords with 10-12 other where clauses and it's around 0.01 sec for the query.
#3

[eluser]bretticus[/eluser]
Have you tried converting your tables to use InnoDB?

Quote:InnoDB has been designed for maximum performance when processing large data volumes. Its CPU efficiency is probably not matched by any other disk-based relational database engine.

The InnoDB storage engine maintains its own buffer pool for caching data and indexes in main memory. InnoDB stores its tables and indexes in a tablespace, which may consist of several files (or raw disk partitions). This is different from, for example, MyISAM tables where each table is stored using separate files. InnoDB tables can be very large even on operating systems where file size is limited to 2GB.

It is significantly slower for writes but you have a shopping cart, not twitter. Smile Anyway, food for thought.
#4

[eluser]danoph[/eluser]
Thanks for the suggestion bretticus. After converting the item_tags table to InnoDB, the query went from 1-3 seconds to over 15. If InnoDB is faster for large tables, I am guessing I will need to do some tweaking to the database or something, because it is ridiculously slow now!

There aren't a lot of good resources on the internet that talk about this stuff when sites and database start getting a lot larger...The best places are sites like these forums where you guys can help. Thanks for that.
#5

[eluser]danoph[/eluser]
@frist44,

There are 3.8 million records in the item_tags table. I wouldn't think sorting them in PHP would scale any better than having MySQL do it? It would have to keep a few thousand to a million records in memory before sorting them if the tag is really popular...
#6

[eluser]fesweb[/eluser]
Just found this line on a MySQL optimizing page:
"Avoid using IN(...) when selecting on indexed fields, It will kill the performance of SELECT query."

Have you tested it by just doing lots of ORs? (tad_id = 12345 OR tag_id = 23456 OR ...)

I have no idea if it's faster, but you can find out easily enough.

So that's one idea. Another is that you should consider some type of page caching if you are querying 3+ million records for 50 related tags on every product page. That's a lot of relationships no matter how you write the query, so it's no surprise that it's slow.

Which brings me to suggestion 3: Is there a way to determine the most important/relevant tags BEFORE you do the query? Of those 50 tags applied to a product, some of them are bound to be more important than others...
#7

[eluser]danoph[/eluser]
Hi @fesweb,

Thanks for the suggestions. What page did you see that suggestion about using IN? I would think the indexes do the total opposite and speed up your select queries since it is selecting IDs that are indexed using a binary tree. There really isn't any manipulation of data going on in the first part of the query which tells me that it's probably not the slowest part. After examining the query in mysql, the slowest part of the query by far is the filesorting, which I believe is used when the initial results are grouped and then sorted. I don't really know how to avoid that part, and I don't want to use caching if it's going to provide old data. I thought about doing some kind of caching but there are so many products being added every day that the data would become outdated really quickly, and querying the database again to update all the time brings up this same problem. Also, I wish I could determine relevance of item_tags before querying them, but the only way I can think of is by adding some kind of weight to the items, which I wouldn't want to do either because it wouldn't provide the most accurate results.

I've been doing a few days worth of digging because there isn't much documentation on this subject, but there are tons of websites that have working implementations like amazon, ebay, itunes, etc...I don't get it. Any ecommerce store usually has "related" or similar items. You'd think someone would have already figured this out.
#8

[eluser]danoph[/eluser]
The slow part is actually the copying to tmp table...I would assume that's because the result set is so large?

Code:
mysql> show profile for query 3;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000118 |
| Opening tables       | 0.000012 |
| System lock          | 0.000007 |
| Table lock           | 0.000008 |
| init                 | 0.000039 |
| optimizing           | 0.000015 |
| statistics           | 0.000013 |
| preparing            | 0.000014 |
| Creating tmp table   | 0.000030 |
| executing            | 0.000004 |
| Copying to tmp table | 1.057663 |
| Sorting result       | 0.023885 |
| Sending data         | 0.000123 |
| end                  | 0.000005 |
| removing tmp table   | 0.000616 |
| end                  | 0.000005 |
| end                  | 0.000005 |
| query end            | 0.000004 |
| freeing items        | 0.000021 |
| closing tables       | 0.000008 |
| logging slow query   | 0.000004 |
| cleaning up          | 0.000005 |
+----------------------+----------+
22 rows in set (0.00 sec)
#9

[eluser]bretticus[/eluser]
Yeah, I think amazon throws more machinery at it. They are probably not using PHP or MySQL either. Neither of which seem to scale well. 3.8 million records is A LOT to query against. I know there are some other database engines that automatically optimize large tables (like mongo), but I have nothing noteworthy to share about your problem tonight. I have *heard* postgres handles large tables better than MySQL. Poor suggestion though.

Good luck!
#10

[eluser]fesweb[/eluser]
danoph,
I found that "don't use IN" bit here. I should have also mentioned this line: "Don't use DISTINCT when you have or could use GROUP BY "

Sorry I can't be more help. In the long run, I think you'll have to end up doing some sort of caching.




Theme © iAndrew 2016 - Forum software by © MyBB