Welcome Guest, Not a member yet? Register   Sign In
Handeling huge join queries
#1

[eluser]Lockzi[/eluser]
Hi,

I'm trying to do a huuuge relation query.

What I'm trying to achieve is something like this:

Code:
$query = $this->db->select('*')
            ->from('applications')
            ->join('applications_categories',    'applications.application_id=applications_categories.application_id', 'left')
            ->join('categories',            'applications_categories.category_id=categories.category_id', 'left')
            ->join('applications_keys',        'applications.application_id=applications_keys.application_id', 'left')
            ->join('keys',                'applications_keys.key_id=keys.key_id', 'left')
            ->join('applications_installations',    'applications.application_id=applications_installations.application_id', 'left')
            ->join('installations',            'installations.installation_id=applications_installations.installation_id', 'left')
            ->join('installations_options',        'installations.installation_id=installations_options.installation_id', 'left')
            ->join('options',            'installations_options.option_id=options.option_id', 'left')
            ->join('applications_downloadsites',    'applications.application_id=applications_downloadsites.application_id', 'left')
            ->join('downloadsites',            'applications_downloadsites.downloadsite_id=downloadsites.downloadsite_id', 'left')
            ->join('downloadsites_scraplocators',    'downloadsites.downloadsite_id=downloadsites_scraplocators.downloadsite_id', 'left')
            ->join('scraplocators',            'downloadsites_scraplocators.scraplocator_id=scraplocators.scraplocator_id', 'left')
            ->get();

The problem with the query is it is, is that I get multiple rows whenever there's a one-to-many relation in the JOIN statement.

To illustrate:
An application can have many keys.

To store that, the relation is handled by application_keys table.
application_keys_id application_id key_id
------------------- -------------- ------
1 1 1
2 1 2

This table indicates that application_id 1, has two different keys.
Classic one to many relation so to speak.

Onto the problem.


The current query would now show the first application two times, it doesn't "group" these together.

I would like the output to be:
Code:
Application id: 1
Key_id: 1
Key_id: 2

but it currently generates:

Code:
Application id: 1
Key_id: 1


Application id: 1
Key_id: 2

The problem is greatly simplified since the massive query has the same problem all over, on some instances I get 18 records of the same application because of different tables relation.


One way to solve this could be something like the following:

Code:
$query = $this->db->get('applications');

foreach($query->result_array() as $application)
{            
            $application['applications_keys'] = $this->db->get_where('applications_keys', array('application_id' => $application['application_id']))->result_array();
            foreach($application['applications_keys'] as $app_key)
            {
                        $application['keys'] += $this->db->get_where('keys', array('key_id' => $app_key['key_id']))->result_array();
            }

What would be the best way for me to solve this?

I want one record of application to include all the data, not multiple rows.
You could use different join type like inner outer left right up down forward backward, but some applications might be missing different data and therefore not show up at all.

Please let me know if you don't understand, I basically want each result to become an array or object array if it matches more results.

Thanks,

Lockzi
#2

[eluser]Lockzi[/eluser]
Finally found a post indicating my exact problem.
http://ellislab.com/forums/viewthread/95402/


There's no definitive solution to it though...
Anyone else who can pitch in?
#3

[eluser]Lockzi[/eluser]
All right...

I've ended up going about it like this.

Clearly inefficient code, but what can you do?

Code: Pastebin (lol... even to big to post!)

It outputs the following for one row:

Code:
array(1) {
  [1]=>
  array(5) {
    ["applications"]=>
    array(5) {
      ["application_id"]=>
      string(1) "1"
      ["application_description"]=>
      string(359) "Mozilla Firefox is a free and open source web browser descended from the Mozilla Application Suite. As of June 2010, Firefox has 23.81% of the recorded usage share of web browsers, making it the 2nd most popular browser in terms of current use worldwide and the most popular browser independent of any one operating system. Thousands of add-ons are available!"
      ["application_title"]=>
      string(15) "Mozilla Firefox"
      ["application_searchableName"]=>
      string(15) "mozilla firefox"
      ["application_versionfinder"]=>
      string(0) ""
    }
    ["categories"]=>
    array(2) {
      [1]=>
      array(4) {
        ["applications_categories_id"]=>
        string(1) "3"
        ["application_id"]=>
        string(1) "1"
        ["category_id"]=>
        string(1) "1"
        ["category_title"]=>
        string(13) "WWW, Internet"
      }
      [3]=>
      array(4) {
        ["applications_categories_id"]=>
        string(1) "5"
        ["application_id"]=>
        string(1) "1"
        ["category_id"]=>
        string(1) "3"
        ["category_title"]=>
        string(8) "Browsers"
      }
    }
    ["keys"]=>
    array(1) {
      [7]=>
      array(4) {
        ["applications_keys_id"]=>
        string(1) "3"
        ["application_id"]=>
        string(1) "1"
        ["key_id"]=>
        string(1) "7"
        ["key_UUID"]=>
        string(15) "MOZILLA FIREFOX"
      }
    }
    ["installations"]=>
    array(1) {
      [1]=>
      array(16) {
        ["applications_installations_id"]=>
        string(1) "3"
        ["application_id"]=>
        string(1) "1"
        ["installation_id"]=>
        string(1) "1"
        ["installation_package"]=>
        string(0) ""
        ["installation_switches"]=>
        string(0) ""
        ["installation_isbeta"]=>
        string(1) "0"
        ["installation_blockingApps"]=>
        string(0) ""
        ["installation_killBeforeLaunch"]=>
        string(0) ""
        ["installation_runBeforeInstaller"]=>
        string(0) ""
        ["installation_betaNewsRssRegExp"]=>
        string(8) "Firefox*"
        ["installation_requiresReboot"]=>
        string(1) "0"
        ["installation_url"]=>
        string(0) ""
        ["installation_is64bit"]=>
        string(1) "0"
        ["installation_version"]=>
        string(6) "3.6.10"
        ["installation_filename"]=>
        string(0) ""
        ["options"]=>
        array(3) {
          [1]=>
          array(5) {
            ["installations_options_id"]=>
            string(2) "38"
            ["installation_id"]=>
            string(1) "1"
            ["option_id"]=>
            string(1) "1"
            ["option_title"]=>
            string(20) "Disable desktop icon"
            ["option_addSwitch"]=>
            string(14) "/nodesktopicon"
          }
          [2]=>
          array(5) {
            ["installations_options_id"]=>
            string(2) "39"
            ["installation_id"]=>
            string(1) "1"
            ["option_id"]=>
            string(1) "2"
            ["option_title"]=>
            string(4) "Test"
            ["option_addSwitch"]=>
            string(4) "test"
          }
          [3]=>
          array(5) {
            ["installations_options_id"]=>
            string(2) "40"
            ["installation_id"]=>
            string(1) "1"
            ["option_id"]=>
            string(1) "3"
            ["option_title"]=>
            string(5) "test2"
            ["option_addSwitch"]=>
            string(5) "test2"
          }
        }
      }
    }
    ["downloadsites"]=>
    array(1) {
      [1]=>
      array(5) {
        ["applications_downloadsites_id"]=>
        string(1) "1"
        ["application_id"]=>
        string(1) "1"
        ["downloadsite_id"]=>
        string(1) "1"
        ["downloadsite_url"]=>
        string(51) "http://www.mozilla.com/en-US/products/download.html"
        ["scraplocators"]=>
        array(1) {
          [1]=>
          array(5) {
            ["downloadsites_scraplocators_id"]=>
            string(1) "1"
            ["downloadsite_id"]=>
            string(1) "1"
            ["scraplocator_id"]=>
            string(1) "1"
            ["scraplocator_linkLocator"]=>
            string(60) "/html/body/div[2]/div/div[3]/noscript/div[2]/ul/li/a/span/em"
            ["scraplocator_versionLocator"]=>
            string(59) "/html/body/div[2]/div/div[3]/noscript/div[2]/ul/li/a[@href]"
          }
        }
      }
    }
  }
}

If anyone ever finds a better way to do this... Please let me know.
#4

[eluser]Lockzi[/eluser]
The next question that has gotten raised is...

How do you handle the update/insertion of such big queries?
One model, multiple models?

Someone must have an idea!
#5

[eluser]tonanbarbarian[/eluser]
first question: do you really need to get every single field from each table in the join?
removing fields that contain data you do not need would simplify things a bit.

secondly if you try to use a distinct in the query that may eliminate any rows that are exact duplicates

really need to know how much data is in each of the join tables and what it is all used for before you can come up with the best way to do things but there are some tips

It is usually better to get the data with 1 query rather than many if you can
However if your main table has lots of records (thousands) and your join tables contain very few records (hundreds or less) and always will contain few records then you can look at just grabbing the data from the main table and using queries to look up the extra information needed, such as the example linked in above.

However this method of running queries inside a loop has issues as well. The example has 6 queries inside a loop.
If the main loop is retrieving hundreds or thousands of rows then you will have 6 extra queries being run from the database for each row. This can result in hundreds of thousands of queries being run.
If you can limit the number of records you run in the main loop you can reduce the number of separate queries

Or you can prefetch join data.
So you run 6 queries before the loop and retrieve the data that would normally be retrieved in each of the 6 queries inside the loop. However these prefetches grab all of the data from the join tables rather than just those relating to a selected record. You then store this join data in their own separate array structures indexed by the values you would use inside the loop to run the queries.
Then inside the loop you lookup the values in the arrays rather than running separate queries.
This uses more memory but can reduce the number of queries from hundreds or thousands to just 7 in this case.

Ultimately you have to weight up all of the factors to determine whether to run a single joined query, a prefetched process, or the loop query process. They all work but each works best in different situations.

Always try to look ahead and estimate how much data could be in the live system in 1 year or even 5 years and let that guide you in the best way to do things.
#6

[eluser]Lockzi[/eluser]
Thank you for your response!

Here's the structure of the database.
Code:
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";




CREATE TABLE applications (
  application_id int(11) unsigned NOT NULL auto_increment,
  application_description longtext collate utf8_bin,
  application_title varchar(255) collate utf8_bin NOT NULL,
  application_searchableName varchar(255) collate utf8_bin default NULL,
  application_versionfinder blob,
  PRIMARY KEY  (application_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE applications_categories (
  applications_categories_id int(11) unsigned NOT NULL auto_increment,
  application_id int(11) unsigned NOT NULL,
  category_id int(11) unsigned NOT NULL,
  PRIMARY KEY  (applications_categories_id),
  UNIQUE KEY `No duplicates` (application_id,category_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE applications_downloadsites (
  applications_downloadsites_id int(11) NOT NULL auto_increment,
  application_id int(11) NOT NULL,
  downloadsite_id int(11) NOT NULL,
  PRIMARY KEY  (applications_downloadsites_id),
  UNIQUE KEY `No duplicates` (application_id,downloadsite_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE applications_installations (
  applications_installations_id int(11) unsigned NOT NULL auto_increment,
  application_id int(11) unsigned NOT NULL,
  installation_id int(11) unsigned NOT NULL,
  PRIMARY KEY  (applications_installations_id),
  UNIQUE KEY `No duplicates` (application_id,installation_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE applications_keys (
  applications_keys_id int(11) NOT NULL auto_increment,
  application_id int(11) NOT NULL,
  key_id int(11) NOT NULL,
  PRIMARY KEY  (applications_keys_id),
  UNIQUE KEY `No duplicates` (application_id,key_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE categories (
  category_id int(11) unsigned NOT NULL auto_increment,
  category_title varchar(255) collate utf8_bin NOT NULL,
  PRIMARY KEY  (category_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE downloadsites (
  downloadsite_id int(11) NOT NULL auto_increment,
  downloadsite_url varchar(255) collate utf8_bin NOT NULL,
  PRIMARY KEY  (downloadsite_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE downloadsites_scraplocators (
  downloadsites_scraplocators_id int(11) NOT NULL auto_increment,
  downloadsite_id int(11) NOT NULL,
  scraplocator_id int(11) NOT NULL,
  PRIMARY KEY  (downloadsites_scraplocators_id),
  UNIQUE KEY `No duplicates` (downloadsite_id,scraplocator_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE installations (
  installation_id int(11) unsigned NOT NULL auto_increment,
  installation_package varchar(70) collate utf8_bin default NULL,
  installation_switches text collate utf8_bin,
  installation_isbeta tinyint(1) default NULL,
  installation_blockingApps mediumtext collate utf8_bin,
  installation_killBeforeLaunch mediumtext collate utf8_bin,
  installation_runBeforeInstaller mediumtext collate utf8_bin,
  installation_betaNewsRssRegExp varchar(255) collate utf8_bin default NULL,
  installation_requiresReboot tinyint(1) NOT NULL default '0',
  installation_url text collate utf8_bin NOT NULL,
  installation_is64bit tinyint(1) NOT NULL default '0',
  installation_version varchar(255) collate utf8_bin default NULL,
  installation_filename varchar(255) collate utf8_bin default NULL,
  PRIMARY KEY  (installation_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE installations_options (
  installations_options_id int(11) unsigned NOT NULL auto_increment,
  installation_id int(11) unsigned NOT NULL,
  option_id int(11) unsigned NOT NULL,
  PRIMARY KEY  (installations_options_id),
  UNIQUE KEY `No duplicates` (installation_id,option_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE `keys` (
  key_id int(11) NOT NULL auto_increment,
  key_UUID varchar(255) collate utf8_bin NOT NULL,
  PRIMARY KEY  (key_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE options (
  option_id int(11) NOT NULL auto_increment,
  option_title varchar(255) collate utf8_bin NOT NULL,
  option_addSwitch varchar(255) collate utf8_bin NOT NULL,
  PRIMARY KEY  (option_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;



CREATE TABLE scraplocators (
  scraplocator_id int(11) NOT NULL auto_increment,
  scraplocator_linkLocator text collate utf8_bin NOT NULL,
  scraplocator_versionLocator text collate utf8_bin NOT NULL,
  PRIMARY KEY  (scraplocator_id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin;


Basically everything inherits application.
Unfortunately everything needs to get pulled while editing.

There may not be any downloadsites for an application (for instance).
If JOIN was used, the whole row would be ignored.

There may be multiple downloadsites for an application.
If LEFT JOIN was used, there would be duplicate results (rows) returned for an individual application.

I might have them mixed up, but this was the reason for why I resulted to my solution.
Did I pick the wrong one? Because I can easily within a couple of months see that the application table consists of > 100 applications.
#7

[eluser]tonanbarbarian[/eluser]
so what you are saying is that in some cases downloadsites have 0 records and in others it has multiple
and that using a left join when there is multiple results in duplicate results?

do you mean the entire row is an exact duplicate? (if so then use distinct in the query)
or do you mean that the application fields data is duplicated?
if so then that is to be expected in this case
why not just detect that in the loop, i.e. record the application_id and check in the loop if you have processed that application_id before. You can still process the data from the other tables in the loop
#8

[eluser]Lockzi[/eluser]
Well, the code posted on postbin is the code I'm using now, and it indeed works flawlessly.

Except for the fact that it has 6 nested foreach loops, which when a lot of data comes along is gonna be inefficient.

The problem with using join was that:

If "Microsoft Office" doesn't have a downloadsite associated, if using JOIN, the whole result row (as in application, downloadsite, scraplocators, keys, categories everything) would be ignored.
It kind of is like when using JOIN, you are saying "if this then join AND if this then join AND if this then join" when using JOIN. If any of the join statements fail, the whole row is neglected.
If an application has all the fields the row is accepted.

Now, if in the other hand... You would use LEFT JOIN, and "Microsoft Office" has 2 downloadsites, the result would look something like:
Code:
["application_id"]=>
      string(1) "123"
      ["application_description"]=>
      string(359) "Microsoft office program"
      ["application_title"]=>
      string(15) "Mozilla Firefox"
      ["application_searchableName"]=>
      string(15) "mozilla firefox"
      ["application_versionfinder"]=>
      string(0) ""


        ["applications_downloadsites_id"]=>
        string(1) "1"
        ["downloadsite_id"]=>
        string(1) "1"
        ["downloadsite_url"]=>
        string(51) "http://www.microsoft.com"

And the following row, as in $row = $this->db->row_array() $row->next_row();

Code:
["applications_downloadsites_id"]=>
        string(1) "3"
        ["downloadsite_id"]=>
        string(1) "8"
        ["downloadsite_url"]=>
        string(51) "http://www.download.com"

If the Application-Has-Multiple downloadsites, the first row will be complete, the second empty except for the additional downloadsite. Presenting such a result is no fun. The ideal would be if MySQL where able to return a result as a nested array, like what I got with my 6-7 foreach loops. Maybe I'm just coming about this all wrong? This amount of data scares me a bit.




Theme © iAndrew 2016 - Forum software by © MyBB