Welcome Guest, Not a member yet? Register   Sign In
How many cached files is too many?
#1

[eluser]AndyP[/eluser]
We have redeveloped a popular website (10,000+ visitors/day) using CodeIgniter, and have enabled the CI caching module on many of our most popular pages. Over the course of a couple days, the caching module generated over 500,000 cache files consuming about 20G of disk space. Based on the website traffic and the size of the underlying database, the total number of cached files seems appropriate.

These files all reside in the same directory (/system/cache). Are there negative performance implications for having so many cache files in a single directory?
#2

[eluser]Skinnpenal[/eluser]
It's more of a performance issue before it becomes a technical limit.

I read some articles about this topic that I can't find at the moment, however I think I found that many people started seeing drastic decreases in speed around 100.000+ files in a single dir, so your site may be a subject of this.

It all comes down to the filesystem used, though.

If I find the articles I'll post links Smile


EDIT:

While searching around now, it seems to be more people mentioning a performance threshold around 10-15.000 files. If that's the case I'm really curious how you experience the performance now with 500.000 files.
#3

[eluser]mjsilva[/eluser]
Not codeignitor expecifiq but for example in squid proxy witch also uses a lot of files, I use razerFS as FS, as i've read it's better handling large number of files.

But remember that having a folder with that amount of files could slow down your entire system not only access to that folder, you could try to hack cache library/helper to separate cache files in folder, imo it's a better aproach for handling large quantity of file.
#4

[eluser]AndyP[/eluser]
The performance of the website seems acceptable most of the time (with occasional, but probably unrelated, hiccups), but doing operations on that folder at the command line is virtually impossible. It takes several seconds to even count the the files (ls -1 | wc).

I'm thinking of modifying the caching class to create a nested directory layer beneath the cache directory. These directories would be named based on the first two characters of the cache filename. So if the cache file is "ab019ba2071cb463aaf0d1861686afaf", then that cache file would be stored in /system/cache/ab/.

This would generate up to 256 directories (16 possible characters in the filename) in the /cache directory, and each individual directory would then contain approximately 2,000 files.

This technique is recommended here: http://serverfault.com/questions/49684/l...ilesystems

EDIT: mjsilva -- Yes, I think that's what I'll try.

EDIT 2: Correction, the hash-based filenames contain only hexadecimal characters, meaning I'll create up to 16*16 (256) directories containing approximately 2000 files each.
#5

[eluser]jedd[/eluser]
You didn't mention what FS you are using there?

reiserFS copes with many files in a single directory much better than ext2/3, for example. Not sure how it compares with ext4, or indeed the jfs/xfs/etc's of the world. ButterFS might be worth a look, though it's not recommended for production systems yet (and is some way off I guess) but OTOH it's stable enough for Linus to use as his rootfs. OTOOH .. Linus really knows what he's doing and he doesn't annoy 10,000 people a day when his laptop goes titsup.

Hitting a file where you know the name is always much much faster than scanning a directory (hence you're not seeing the web equivalent of your multi-second 'ls | wc' CLI experience), but I still reckon there has to be a break-even point where it's more expensive to find the cache file than to re-generate the results. What options exist on tuning that cache - is it just a matter of age limiting entries? I don't have much experience with caching at this level.
#6

[eluser]AndyP[/eluser]
Thanks for the ideas -- the filesystem is ext3, and I'd rather not stray too far from my plain-vanilla CentOS 5 installation on Slicehost, so I don't think a different filesystem is an option.

My understanding is that the cache just checks to see if the cache file exists (if not, it creates it) and if it is recent enough (if not, it overwrites it) before it serves the cache file. The CI cache only takes a single parameter: the number of minutes a cache file should be used before it is regenerated.

Earlier today no I modified the caching functions in the output class to create and use a layer of subdirectories. We will see if this helps the overall performance of the site. Unfortunately, I don't have a single symptom I can point to and say "now it's fixed!", just a suspicion that so many files in a single directory couldn't be good.
#7

[eluser]bretticus[/eluser]
With a site that active, why use file caching at all. You mentioned slicehost, do they offer php app caching? You ought to get significantly better performance over 20 GB of file cache!!!
#8

[eluser]AndyP[/eluser]
We are running eAccelerator already:

Code:
PHP 5.1.6 (cli) (built: Apr  7 2009 08:00:18)
Copyright (c) 1997-2006 The PHP Group
Zend Engine v2.1.0, Copyright (c) 1998-2006 Zend Technologies
    with eAccelerator v0.9.5.2, Copyright (c) 2004-2006 eAccelerator, by eAccelerator

The expense of page generation is really more in the database queries than in PHP anyhow. The database is very active, so built-in MySQL query caching only takes us so far. Perhaps it makes sense to look into a Squid-based reverse proxy.
#9

[eluser]bretticus[/eluser]
[quote author="AndyP" date="1251503836"]We are running eAccelerator already:

Code:
PHP 5.1.6 (cli) (built: Apr  7 2009 08:00:18)
Copyright (c) 1997-2006 The PHP Group
Zend Engine v2.1.0, Copyright (c) 1998-2006 Zend Technologies
    with eAccelerator v0.9.5.2, Copyright (c) 2004-2006 eAccelerator, by eAccelerator

The expense of page generation is really more in the database queries than in PHP anyhow. The database is very active, so built-in MySQL query caching only takes us so far. Perhaps it makes sense to look into a Squid-based reverse proxy.[/quote]

I've never used eAccelerator and I guessed it was database bottleneck anyways. We use APC actually, and we actively cache arrays generated from database queries in memory since many a dynamically generated page has fairy static data. We do not get the traffic you do (and these are shared hosts machines but we do pretty much all of the coding for clients. Futhermore, they are virtual machines and I see significant performance caching data in memory like this.) I hear memcache works very well for caching data in this manner. Another article for memcache and app scaling (I think it should be noted he did his benchmarking on his own computer with the database engine on the same machine no doubt.) Food for thought.
#10

[eluser]bretticus[/eluser]
For example, here is one of my model methods. It's a bit older code (so it's in php4 fashion.)

Code:
function get_attributes($productid)
    {
        $this->db->select('ProductAttributes.AttributeID, Attributes.AttributeName');
        $this->db->select('ProductAttributes.AttributeValue, ProductAttributes.Modifier');
        $this->db->from('Products');              
                $this->db->join('Attributes', 'ProductAttributes.AttributeID = Attributes.AttributeID');
                $this->db->where('ProductAttributes.Deleted', 0);
                $this->db->where('ProductAttributes.ProductID', $productid);
                        // get compiled select statement
        $compiled_query = $this->db->_compile_select();
        echo $compiled_query;
        if ( apc_fetch(md5($compiled_query)) ) {
            // clear select
            $this->db->_reset_select();
            $returned = apc_fetch(md5($compiled_query));
        } else {            
            $query = $this->db->get();
            foreach ($query->result_array() as $row)
                $returned[] = $row;
            apc_store(md5($compiled_query), $returned, APC_CACHE_TTL);
            $query->free_result();
        }
        return $returned;
    }

EDIT: This was some of my first CI code. Now I realize I should have extended Model to avoid DNRY infractions. Smile




Theme © iAndrew 2016 - Forum software by © MyBB