[eluser]codex[/eluser]
I need to import very large CSV files into the DB. Instead of importing the thing at once I first chop it into smaller files (1mb) and put all those files into their own dir. Using 'for' I then loop over the files and import them one by one. But every time I get a 'Fatal error: Out of memory (allocated ...)' error. Echoing memory_get_usage() shows that after each insert the memmory is increased, even though I'm unsetting the variables.
I thought that when the loop starts over, memory is freed. This isn't the case. How would you do a large insert like that, but without increasing the memory? I hope it makes sense what I'm saying.
This is a small piece of the code. Can you spot anything wrong with it?
Code: for ($i=0,$count=count($dir);$i<$count;$i++)
{
// Open the file
$handle = fopen(APPPATH.'feeds_csv/temp/'. $ar .'/'. ($i +1) .'.csv', "r") or die ("<fieldset>couldn't open file</fieldset>");
$arrays = array();
// While loop with fgetcsv sorts the csv into the data array
while (($data = fgetcsv($handle, 3000, ";")) !== FALSE)
{
$arrays[] = $data;
}
unset($data);
fclose($handle);
// Remap array keys
if ($i == 1)
{
foreach($arrays[0] as $key => $val) // 0 => url, 1 => title
{
$keys[$val] = $key; // ['url'] => 0
}
foreach($par as $key => $val) // prodId => offerid
{
if (in_array($val, $arrays[0]))
{
$dez[$key] = $keys[$val];
}
}
}
$row = 0;
foreach($arrays as $a)
{
if ($row != 0)
{
$pVendor = $res['merchant_name_urlsafe'];
$pCategory = '';
$productKey = md5($res['myFeedKey'] . $a[$dez['pId']]);
$feedKey = $res['myFeedKey'];
$sql = 'INSERT IGNORE INTO products_temp
(productKey, feedKey)
VALUES ("'. $productKey .'","'. $feedKey .'")';
$this->CI->db->query($sql);
$sql = NULL;
$handle = NULL;
$arrays = NULL;
$a = NULL;
}
++$row;
echo $row ,' - ', memory_get_usage() ,'<br />';
}
}
[eluser]TheFuzzy0ne[/eluser]
One thing that might be causing the issue, may be the stack. I'm sure that it shouldn't make much difference, but it's certainly something to consider.
Rather than load those rows from the file first, can you not create the SQL query on-the-fly as you read each line from the CSV file?
I'd also suggest writing the SQL query to a file one line at a time instead of storing it in memory, and then loading it when you're done. That should keep memory usage to a minimum. In theory, if you can do both of these, you should be able to parse quite large files, and only need to worry about the allowed execution time.
Hope this helps.
[eluser]codex[/eluser]
Quote:One thing that might be causing the issue, may be the stack. I'm sure that it shouldn't make much difference, but it's certainly something to consider.
I'm gonna sound very noobish with this, but: Stack?
Quote:Rather than load those rows from the file first, can you not create the SQL query on-the-fly as you read each line from the CSV file?
You will have to explain this procedure.
Quote:I'd also suggest writing the SQL query to a file one line at a time instead of storing it in memory, and then loading it when you're done. That should keep memory usage to a minimum. In theory, if you can do both of these, you should be able to parse quite large files, and only need to worry about the allowed execution time.
Small example maybe?
Quote:Hope this helps.
It may ;-)
[eluser]TheFuzzy0ne[/eluser]
OK, I'm going off the top of my head here, so don't expect the code to work, I'm simply trying to illustrate a concept.
Code: <?php
# Open the input file
$in_file = fopen('somefile', 'r');
# Open the output file
$out_file = fopen('somefile', 'a');
if ($in_file && $out_file)
{
fwrite($out_file, "INSERT IGNORE INTO products_temp (productKey, feedKey) VALUES\n");
# From here until the start of the loop, is just an ugly hack so that we can
# insert the commas into the file, as MySQL will choke if the query ends with a comma.
$csv_row = fgetcsv($in_file);
# I'm not entirely sure where this data comes from, but...
$product_key = $csv_row['whatever'];
$feed_key = md5($csv_row['whatever_else']);
fwrite($out_file, "(". $product_key . "," . $feed_key .")\n");
while (($csv_row = fgetcsv($in_file)) !== FALSE)
{
$product_key = $csv_row['whatever'];
$feed_key = md5($csv_row['whatever_else']);
# Prepend a comma
fwrite($out_file, ",\n");
# Write the insert data to the file.
fwrite($out_file, "(". $product_key . "," . $feed_key .")\n");
}
fwrite($out_file, ";");
fclose($in_file);
fclose($out_file);
}
# Ta-da! Thank you very much, I'm here all night...
I don't fully understand how you're processing your data, but I'm sure that with a little work, and a lot of shouting, you can get this to work in your set-up.
[eluser]codex[/eluser]
[quote author="TheFuzzy0ne" date="1239049255"]OK, I'm going off the top of my head here, so don't expect the code to work, I'm simply trying to illustrate a concept.[/quote]
But wouldn't that also create a very large file? (haven't tested yet)
[eluser]TheFuzzy0ne[/eluser]
Yes, but only slightly larger than the original input file.
You are encouraged to improve on it, I was just trying to explain how it could work. You can have your file broken up into smaller chunks (each insert inserts 50 rows, for example), the whole point was to basically allow you to do this:
Code: +---------+ +----------+
| | +------------+ | | +-----------+
| In file |-->--| Processing |-->--| Out file |-->--| DB Insert |
| | +------------+ | | +-----------+
+---------+ +----------+
| |
<-----<-----<-----<-----<-----<-----<
The out file is a replacement for your memory - You grab a little data, process it, and put it in a file, grab a bit more data, process it, and put it in a file, repeating in a loop. When done, you can load the file from the disk, and run the query.
The more inserts you can fit into a single query, the better. One query inserting 50 rows is better performance-wise than inserting 50 row with 50 queries.
[eluser]NogDog[/eluser]
I've used something like the following (though not in CI to date) for inserting lots of rows, breaking the inserts up into manageable "chunks":
Code: <?php
$chunkSize = 20; // number of rows to insert per query
fgetcsv($handle, 3000, ";"); // skip first line;
$counter = 0;
$values = array();
while (($data = fgetcsv($handle, 3000, ";")) !== FALSE)
{
if(count($data) < 2)
{
continue;
}
$counter++;
$values[] = sprintf(
"('%s', '%s', %d)",
$this->db->call_function('escape_string', $data[0]),
$this->db->call_function('escape_string', $data[1]),
(int) $data[2]
);
if($counter == $chunkSize)
{
doInsert($values);
$values = array();
$counter = 0;
}
}
if($counter > 0) // still some values to be inserted
{
doInsert($values);
}
// insert an array of rows
function doInsert($values)
{
$sql = "INSERT INTO some_table (`col1`, `col2`, `col3`) VALUES\n";
$sql .= implode(",\n", $values);
$this->db->query($sql);
// add some error-checking/logging here
}
[eluser]slowgary[/eluser]
Is the high memory usage due to opening so many database connections? I don't know how CI does it, but if it doesn't close your connections automagically then you should probably do it after every query. Why couldn't you do that AND insert ~50 rows per query?
Pseudocode:
Code: $count = 0;
$file_handle = open_file();
$query = 'INSERT INTO `table` (`column1`, `column2`, `column3`) VALUES ';
while(!end_of_file($file_handle))
{
if($count == 50)
{
$count = 0;
$query .= '('. fgets($file_handle) .')';
mysql_query($query);
mysql_close();
$query = 'INSERT INTO `table` (`column1`, `column2`, `column3`) VALUES ';
}
else
{
$query .= '('. fgets($file_handle) .'),';
count++;
}
}
Correct me if I'm wrong (probably), but I don't see why doing it this way should use any significant amount of memory.
[eluser]codex[/eluser]
[quote author="NogDog" date="1239070917"]I've used something like the following (though not in CI to date) for inserting lots of rows, breaking the inserts up into manageable "chunks":
Code: <?php
$chunkSize = 20; // number of rows to insert per query
fgetcsv($handle, 3000, ";"); // skip first line;
$counter = 0;
$values = array();
while (($data = fgetcsv($handle, 3000, ";")) !== FALSE)
{
if(count($data) < 2)
{
continue;
}
$counter++;
$values[] = sprintf(
"('%s', '%s', %d)",
$this->db->call_function('escape_string', $data[0]),
$this->db->call_function('escape_string', $data[1]),
(int) $data[2]
);
if($counter == $chunkSize)
{
doInsert($values);
$values = array();
$counter = 0;
}
}
if($counter > 0) // still some values to be inserted
{
doInsert($values);
}
// insert an array of rows
function doInsert($values)
{
$sql = "INSERT INTO some_table (`col1`, `col2`, `col3`) VALUES\n";
$sql .= implode(",\n", $values);
$this->db->query($sql);
// add some error-checking/logging here
}
[/quote]
Thanks, I've tried your method and although memory consumption seems to be less it still doesn't work for me. I'm giving fuzzy's solution a try now.
[eluser]codex[/eluser]
[quote author="slowgary" date="1239072766"]Is the high memory usage due to opening so many database connections? I don't know how CI does it, but if it doesn't close your connections automagically then you should probably do it after every query. Why couldn't you do that AND insert ~50 rows per query?
Pseudocode:
Code: $count = 0;
$file_handle = open_file();
$query = 'INSERT INTO `table` (`column1`, `column2`, `column3`) VALUES ';
while(!end_of_file($file_handle))
{
if($count == 50)
{
$count = 0;
$query .= '('. fgets($file_handle) .')';
mysql_query($query);
mysql_close();
$query = 'INSERT INTO `table` (`column1`, `column2`, `column3`) VALUES ';
}
else
{
$query .= '('. fgets($file_handle) .'),';
count++;
}
}
Correct me if I'm wrong (probably), but I don't see why doing it this way should use any significant amount of memory.[/quote]
Closing the db connection doesn't seem to help so I guess that's noit the problem. Thanks for your response though.
|