Welcome Guest, Not a member yet? Register   Sign In
Automatic templating of text files within a directory (with caching)
#1

[eluser]BlkAngel[/eluser]
Greetings everyone!

I'm afraid that I require a little bit of assistance laying down the groundwork for a small personal project of mine. I'm basically trying to dive into PHP OOP, and I already did one small thing that works, so now I'm trying to make a little more ambitious project for myself.

Background: I have a lot of emails I want to save, and forgetting all the high-level Gmail stuff out there, what I have been doing for a while now is a simple copy/paste into text files in various subdirectories.

Synopsis:
I was thinking that should be possible to use Code Igniter for this task to build some experience using it (I tried playing with Zend Framework and for some reason the .htaccess stuff didn’t work out too well with calling the correct controller methods).

For the example, let’s say this is our directory structure:


Code:
/myTextFiles
  /sample1
  - a.txt
  - b.txt
  - c.txt
  /sample3
  - d.txt
  - e.txt
  - f.txt

What I would like to know is the best way to index and cache the contents of this directory, at least the filenames and hierarchy. For example, visiting the page responsible for showing the contents of the root directory (/myTextFiles in this case), I would expect to see:

sample1
sample2


Based on the directory names. Then following “sample1” link for example, I would get:

a
b
c


…allowing me to follow any of those links and see the contents of the corresponding .txt file in a templated page.

Problems I need to solve:
1) I don’t want to recurse down the directory tree every time a request comes in for /myTextFiles… if there are several hundred files in there, that is a lot of system calls just to display some links. Ideally, until something gets changed: a new file uploaded or some file deleted, I’d like to cache the directory structure in some way (DB perhaps). Does anyone know the groundwork of what might work best here, since this is somewhat uncharted territory for me at this moment? I have a feeling trees and parent/child relationships will be key here.

2) It would be easier to make it one directory level deep, but ideally I’d like to expand it for infinite recursion, so /myTextFiles -> DirectoryA -> DirectoryB ->… etc. as far as it needs to go (maybe an arbitrary limit designated by a constant or something).


I’d appreciate any suggestions or advice any of you might offer.
#2

[eluser]sophistry[/eluser]
PHP has a wide range of directory accessing functions. They are fast and even do directory recursion for you.

see The very useful PHP manual in english
#3

[eluser]BlkAngel[/eluser]
Hehe. The PHP manual is nothing new to me, but I was curious if there was a solution to the caching issue, since I don't want to recurse a directory with every page access. That is the real question I have.
#4

[eluser]sophistry[/eluser]
Getting directories really is fast... try the PHP functions and then see if it is too slow for you; it sounds like you haven't yet built the code - I predict you will test and find it plenty fast even without CI's caching functionality.
#5

[eluser]OwanH[/eluser]
[quote author="BlkAngel" date="1184043776"]Hehe. The PHP manual is nothing new to me, but I was curious if there was a solution to the caching issue, since I don't want to recurse a directory with every page access. That is the real question I have.[/quote]

Yup, there is. I would definitely go with the idea of caching the directory heirarchy using a DB. I've written a class for you that does just that but before delving into the code though let me explain the DB tables that are needed. Here's the SQL I used to create them:

Code:
CREATE TABLE directory_tree (
    entry_id int(10) unsigned NOT NULL auto_increment,
    entry_name varchar(255) NOT NULL default '',
    entry_type enum('File','Folder') NOT NULL default 'File',
    parent_id int(10) unsigned NOT NULL default '0',
    PRIMARY KEY  (entry_id)
) ENGINE=MyISAM;

CREATE TABLE ptr_directory_tree_root (
    root_entry_id int(10) unsigned NOT NULL default '0',
    PRIMARY KEY  (root_entry_id)
) ENGINE=MyISAM;

Now the directory_tree table is used to store your filename and heirarchy information whereas the ptr_directory_tree_root table stores a single value and that is the ID of the row in the directory_tree table that stores the info. for the root directory.

Here's a breakdown of the fields in the directory_tree table:

entry_id - Unique ID of a single entry in the directory tree.
entry_name - The entry name (file/folder name)
entry_type - Is it a file or a folder?
parent_id - The parent folder entry id

Looks like I have too many characters in the code to fit in this post, so I'll post the code for the controller class that does the ground work immediately after this post.
#6

[eluser]OwanH[/eluser]
OK so as I promised, here's the code for the controller. I hope the documentation embedded in the code is adequate. Note that the code uses CI's Active Record class so make sure it is loaded if you decide to run with this code. Let me just say I tested it out and it works like a charm.

Code:
<?php
class Directory_tree_cache extends Controller {

  var $test_dir;                         // our test directory
  var $tbl_name = 'directory_tree';      // name of the DB table

  // The table 'ptr_directory_tree_root' has at most one (1) row that stores one (1)
  // single value; the entry_id of the row containg the root directory info.
  //
  var $ptr_root_dir_id_tbl = 'ptr_directory_tree_root';

  // Contructor
  function Directory_tree_cache()
  {
    // Call parent contructor
    parent::Controller();
  
    $this->load->helper('url');   // Load URL helper (for 'redirect' and 'site_url')

    $this->load->database();      // Connect to the database
    
    // I'm setting the entire "application" folder as my test directory. Change to suit your needs.
    $this->test_dir = APPPATH;
  }

  function index()
  {
    // Check if cache exists.
    if ($this->db->count_all($this->tbl_name) > 0)
    {
      $query = $this->db->get($this->ptr_root_dir_id_tbl);
      $row = $query->first_row();

      redirect("/directory_tree_cache/show_dir/" . $row->root_entry_id);
    }
    else {
      // This redirect will result in the cache being created and the contents
      // of the root folder will be shown.
      redirect('/directory_tree_cache/show_dir/0');
    }
  } // index
  
  function show_dir($dir_id)
  {
    // STEP 1: Let's see if a cache already exists for the folder heirarchy.
    if ($this->db->count_all($this->tbl_name) > 0)
    {
      // STEP 2: If cache exists, get directory contents from it.
      $query = $this->db->getwhere($this->tbl_name, "parent_id=$dir_id");
      
      foreach ($query->result() as $row)
      {
        if ($row->entry_type == 'Folder')
        {
          $link = site_url('/directory_tree_cache/show_dir/' . $row->entry_id);
          $str = '<a href="' . $link . '">' . $row->entry_name . '</a>';
        }
        else {
          $link = site_url('/directory_tree_cache/show_file/' . $row->entry_id);
          $str = '<a href="' . $link . '">' . basename($row->entry_name) . '</a>';
        }

        echo $str . '<br />';
      }
    }
    else {
      // STEP 3: If cache does not exist, create it before outputting content structure.
      // NOTE: Zero (0) is used as the parent ID value for the info stored on the root
      // directory.
      $this->db->insert($this->tbl_name, array('entry_name' => $this->test_dir,
                        'entry_type' => 'Folder', 'parent_id' => 0));

      // Store this, we'll need it next few lines.
      $insert_id = $this->db->insert_id();

      // Now open the test directory and generate a DB cache of the entire tree structure.
      $this->_create_cache($this->test_dir, $insert_id);
      
      // Store ID of the row with the root directory info. in our reference table.
      $this->db->delete($this->ptr_root_dir_id_tbl, '1');
      $this->db->insert($this->ptr_root_dir_id_tbl, array('root_entry_id' => $insert_id));
      
      // Delete DB caches created by CI for this controller, if you have configured CI to
      // automatically cache your database queries. If you have NOT activated query caching
      // then comment out the following 3 lines.
      $this->db->cache_delete('directory_tree_cache', 'index');
      $this->db->cache_delete('directory_tree_cache', 'show_dir');
      $this->db->cache_delete('directory_tree_cache', 'show_file');
      
      // Redirect and show contents of the root directory.
      redirect("/directory_tree_cache/show_dir/$insert_id");
    }
  } // show_dir
  
  function show_file($file_id)
  {
    $query = $this->db->getwhere($this->tbl_name, "entry_id=$file_id");
    $row = $query->first_row();
    
    // Show the contents of the file.
    echo file_get_contents($row->entry_name)
  } // show_file

  /* @private helper - creates a cache of the directory tree heirarchy. */
  function _create_cache($dir, $parent_id)
  {
    if (substr($dir, strlen($dir)-1, 1) != '/')
      $dir .= '/';

    if ($handle = @opendir($dir))
    {
      while ($file = readdir($handle))
      {
        // Ignore reference to 'self' and parent directories.
        if (($file == ".") || ($file == ".."))
          continue 1;

        if (is_dir($dir . $file))
        {
          // Sub-directory found, cache info. and go recursive.
          $this->db->insert($this->tbl_name, array('entry_name' => $file,
                            'entry_type' => 'Folder', 'parent_id' => $parent_id));
          $this->_create_cache($dir . $file, $this->db->insert_id());
        }
        elseif (is_file($dir . $file))
        {
          // File found, simply cache info.
          $this->db->insert($this->tbl_name, array('entry_name' => $dir.$file,
                            'entry_type' => 'File', 'parent_id' => $parent_id));
        }
      } // end while ($file = readdir($handle))

      @closedir($handle);     // Prevent resouce deadlock!
    }
  } // _create_cache

} /* End class Directory_tree_cache */

?&gt;

Hope this proves useful man. Knock urself out. Smile
#7

[eluser]BlkAngel[/eluser]
Thank you! I'll have some fun playing around with this.

Two questions that come to mind immediately are:
1) What is the caching strategy? How do you know when to retrieve a new copy and re-populate the DB?

2) For my specific application, would it be wiser to attempt to convert your controller into an externally loaded library instead? The controller aspect of it would, in line with my app idea, be reserved for templating and parsing directories, it seems.
#8

[eluser]OwanH[/eluser]
[quote author="BlkAngel" date="1184114143"]Thank you! I'll have some fun playing around with this.

Two questions that come to mind immediately are:
1) What is the caching strategy? How do you know when to retrieve a new copy and re-populate the DB?

2) For my specific application, would it be wiser to attempt to convert your controller into an externally loaded library instead? The controller aspect of it would, in line with my app idea, be reserved for templating and parsing directories, it seems.[/quote]

Well the truth is I wanted to see if I could implement a solution to your problem that wasn't too overly complicated just to see if I was on the right track about exactly what you were asking. It definitely could be and needs to be more extensive but hopefully I was on the right track Smile.

So, to answer your questions:

1) The caching strategy is pretty simple: check if the directory tree structure has been cached each time a request comes in for the root directory or any sub-directories n-levels deep (where n >= 1 and is really only limited by storage space and server resource limits). If it is then we get a listing of the directory's contents from the DB cache, otherwise the cache is created on the fly, that is the DB is populated. Now as the code stands, there are a couple of caveats, as explained below:

1.1) There are no checks in place to determine if any file/folder within the heirarchy has been modified since it's info. was cached, and so the DB is never re-populated after the first cache is created. This of course could be easily implemented by adding a last_modified timestamp field to the directory_tree table to store the time the file/directory was last modified. That way, if a cache exists when a request is made then a system call can be made to check the time of the most recent modification against the value stored in the last_modified field and if the system call's reported time is more recent the cache would be deleted and re-created for that file/directory, and of course recursive checks could be made on sub-directories. Also full pathnames should be stored in the DB for each directory/file with info. cached (the class currently only stores full pathnames for files, not directories).

1.2) The code currently lacks strong error-checking, for things like cache references to deleted/non-existent files or folders. Like I said earlier, I didn't wanna make the code look too complicated.

2) Yes I agree 100% that it would be better and wiser to convert this controller into an externally loaded library that has the core routines for managing the directory cache or "index". The library could take care of creating/updating/deleting the cache, and provide interface methods for retrieving/updating directory listings and file content, etc. That way, like u said, your controller could be reserved for templating and parsing directories, and hey you could even throw in a couple views to display your heirarchy links if you feel dangerously MVC Smile.
#9

[eluser]sophistry[/eluser]
now, are you ready to do some tests that show that this is an optimization over simply reading the directory tree each time?

i'd bet that you'd have to have some pretty deep trees and thousands of files to see any benefit to building out this whole set of code.
#10

[eluser]OwanH[/eluser]
[quote author="sophistry" date="1184140403"]now, are you ready to do some tests that show that this is an optimization over simply reading the directory tree each time?

i'd bet that you'd have to have some pretty deep trees and thousands of files to see any benefit to building out this whole set of code.[/quote]

sophistry you are right, there'd have to be some pretty deep trees and a significant number of files to see any noticeable benefit to building out this set of code. And as you said getting directories is really fast in PHP because PHP simply maps their functions to appropriate system calls.

Having said that, the reason I threw together this set of code was to simply help show BlkAngel a solution to what he/she was asking. I do not know the scope of his/her specific application and so I have no grounds on which I can say whether to use my solution (or a modification of it) or simply go with PHP's directory/filesystem functions. BlkAngel would have to determine which is a more suitable solution for the application.

Also, keep in mind that there is nothing new about a solution such as this. I know some operating systems (like Win XP, UNIX, Linux) provide an implementation of some kind of directory "index" builder that allows you to create a cache of some sort of the contents and heirarchy tree of a given directory. But, to my knowledge, it's up to the user to enable such a feature on these OS'. It's all simply a matter of context. So I am in no way forcing my solution on BlkAngel, it's his/her call.

One more thing I would add is that I agree some tests should indeed be done to determine if this kind of solution would provide an optimization within the scope of the target application.




Theme © iAndrew 2016 - Forum software by © MyBB