Welcome Guest, Not a member yet? Register   Sign In
Advanced MySQL Question - Retrieving Similar Items Based on Tags
#11

[eluser]danmontgomery[/eluser]
InnoDB uses row-level locking, as opposed to the table-level locking that MyISAM uses. The only time you'll see a speed improvement by switching to InnoDB is when the amount of UPDATE/INSERT/DELETE operations far outnumber the amount of SELECT operations. I have to disagree that 3.8 million records is a lot of data... It's more than a little data, and probably a lot of data for a vanilla MySQL configuration, but a standard MyISAM table can hold over 4 billion rows, the system was designed to be capable of handling much more than 3 million.

In my experience, storing item/tag relationships in a pivot table is the slowest way of retrieving that data. As long as you're using MyISAM, you might as well be taking advantage of fulltext indexes with MATCH ... AGAINST.

Like fesweb said, you really should be doing some caching as well. If you're looking for a faster database, you might consider cassandra... And I think this sums things up nicely.
#12

[eluser]danoph[/eluser]
As you guys said, MyISAM and MySQL are built to handle a lot more than 3 million records. And, every php/mysql/ruby on rails tutorial, book, blog article, and class teaches using 3 tables for polymorphic associations (items, tags, taggings aka item_tags here). I've never seen it done another way, and you always end up with redundant data if you don't use three tables. Plus, I would think using integers would be the fastest route to approach the problem rather than using full text, since the integers can be indexed for one, and the data is a lot smaller? I'm no expert though, and I can't find any books or resources on the internet about this. Really frustrating!

While experimenting with the table indexes and mysql server settings, I was able to get the query I was using from 1.3 seconds down to around .4, but I think there is still room for improvement. When the site keeps growing, that number will just keep getting bigger. I am trying to brainstorm of a way to use PHP to help out or my sphinx search engine instead since its' queries are super fast...this is a huge challenge and I am still baffled there aren't sites on the internet that have this problem solved already!

Hmm...facebook and digg use cassandra? I may have to look into that...I know facebook uses polymorphic associations by "recommending" friends to you. Maybe that would be a good route to go. Mongo looks interesting also. I'll have to spend a few hours looking at these now Smile Thanks!
#13

[eluser]danoph[/eluser]
Oh, and from my previous post - there's only ONE slow part to the query where it copies to the temporary table, so thats why it seems so easily fixable to me...If I could figure out how to make that part faster and the whole query would be really fast!
#14

[eluser]danmontgomery[/eluser]
[quote author="danoph" date="1269393448"]and you always end up with redundant data if you don't use three tables[/quote]
Not true...

[quote author="danoph" date="1269393448"]Plus, I would think using integers would be the fastest route to approach the problem rather than using full text, since the integers can be indexed for one, and the data is a lot smaller?[/quote]

http://dev.mysql.com/doc/refman/5.0/en/f...earch.html
#15

[eluser]danoph[/eluser]
@noctrum,

Without going off-topic too much, how would you use 2 tables for items and tags without duplicating the tag string every time?

Also, I am looking more into cassandra, that may be a route I want to take since facebook, digg and twitter are using it. Thanks for sending that
#16

[eluser]danmontgomery[/eluser]
You would be duplicating the tag string for every record with that tag... That's not really the same as "duplicate data". Duplicate data would storing the same tag twice for one item, or storing the same item twice identically (or, even worse, not identically).

And just a note, the data is most definitely smaller, but that doesn't necessarily make for faster queries.
#17

[eluser]danoph[/eluser]
Ok, I see what you mean...I was thinking that tags could get really long and it would be bad to have long strings duplicated since it is literally more data than an integer...

Back to the topic though, is there any way of speeding up the "copying to temporary table" part of the query?
#18

[eluser]danmontgomery[/eluser]
Not that I'm aware of
#19

[eluser]Chad Fulton[/eluser]
Here are some thoughts:

1. Caching. You should do it. What you're talking about is a data-mining task and probably shouldn't be performed on-the-fly. I highly recommend thinking about this before trying anything else.

2. Copying to temporary table takes a while because with large temporary tables, the default settings cause writes to the hard disk. You can make sure that the temporary table stays in memory, however (if you have enough memory, that is):
a. Increase the "key_buffer_size" server variable, which should help. Take a look here for more information on this variable as well as how you can test your system to tune the value to something appropriate for your situation.
b. Increase the values of the "tmp_table_size" and "max_heap_table_size" (you need to increase them both to see an effect), e.g. as described here. Be sure here, also, to read the information about how to test your system to tune the value to an appropriate one.

3. If #2 doesn't work because of memory constraints:
a. Buy more memory.
b. Buy a faster hard drive (solid state?)
#20

[eluser]danoph[/eluser]
Hi Chad,

Thanks for the suggestions! To your points:

1. I agree that caching would be a good route, only if I could update the cached data every day since we add thousands of new "products" each day. Data older than a day would probably be inaccurate so it would have to refresh every 24 hours. I just did a quick calculation...if we have a database of 130,000 products that is growing a couple thousand every day, and each query takes about a second to perform... 130,000 / 60 = 2166 minutes / 60 = 36 hours. So the queries would have to be constantly running 24/7 to have somewhat accurate data. It seems like I'm back at square one after that: How do I make the query faster?

2. I think this will be a good route for me to take. I've read in various places that this should be able to help, but I've also read that it can have the opposite effect...I will let you know after I try tweaking the mysql settings.

3. We are on a maxed out mediatemple (dv) server and the next step up is a $750/mo nitro server. I don't know if I'm ready for one of those yet!




Theme © iAndrew 2016 - Forum software by © MyBB