Welcome Guest, Not a member yet? Register   Sign In
storing markup, again (or... html into database)
#1

[eluser]slowgary[/eluser]
Yo guys (and girls?),

I know this one's been mentioned before but the posts I read didn't really satisfy my requirements. I'm putting together a basic CMS and need to store the markup for each page in a table. I'm using markItUp as the editor, which I think is a wonderful peice of software.

What I'm trying to figure out though is the best way to escape markup before putting it into the database. Does it even need to be escaped? I've seen people recommend using htmlspecialchars() or htmlentities() but where I see the problem is when pulling the markup back out of the database. The reverse of those functions will not only turn &amp;lt; into <, but also &amp;amp; into &. I come from a world of 100% markup validation and I'd hate to be unable to store valid html entity names in the database.

What do I do?
#2

[eluser]Colin Williams[/eluser]
Clean/format it on the way out, not the way in. The only concern you should have on the way into the database is SQL attacks, which CI's db class will clean for you.
#3

[eluser]slowgary[/eluser]
But what would I even format on the way out? Do I only need to worry about stripping slashes?
#4

[eluser]TheFuzzy0ne[/eluser]
You'd also need to escape HTML entities, wouldn't you?
#5

[eluser]xwero[/eluser]
Markup is going to anything to your database, you can't view the field value without querying it. So escaping html entities is not needed. You don't even have to be afraid of xss attack code. Escaping html entities only adds more characters which means more storage space is required.

The only thing you have to be afraid of is sql attacks, for example ending the statement and inserting a new to delete data. But if you use the AR methods you already are safe, otherwise you need to use the escape methods.
#6

[eluser]TheFuzzy0ne[/eluser]
I meant on the way out. You'd need to escape it so it doesn't break any forms. Smile
#7

[eluser]slowgary[/eluser]
Is there an industry standard way this is done? What do the big popular CMS packages do?

If markup is fine to store in a database, I'll still need to filter for html entities that are NOT part of an html tag, and I would probably do this on the way into the database or as part of the html editor so that it only needs to be done once, as opposed to on every page load.

Other questions still linger in my head, like searching pages. If the database stores the page with markup, how would you go about a user search function? If a user types &lt;p&gt; into the search, is it going to return all pages because it searches the page's markup too?

What about compression? Is storing this markup in the database a bad idea to begin with? I did some google searching and a few people suggested storing pages outside the database as individual flat files. That seems stupid to me though because they're both stored on a hard drive, except with a flat file you lose all the benefits of SQL.

I know this might seem like a stupid topic but I'd really love to figure out the 'BEST' method.
#8

[eluser]xwero[/eluser]
Quote:I’ll still need to filter for html entities that are NOT part of an html tag
Why? If the characters are used in plain text the parser will not process them.

If you want to fulltext search the html snippets you could strip all the tags and store it as plain text with a fulltext flag.

Saving the snippets as files will increase the performance as you bypass the database querying.

You're asking multiple questions now which makes it harder to find a 'best' method as you add more factors to the equation. So you want to build a cms where parts of your pages are static code.
My advise is keep everything in the database to create a well functioning product in a 'short' timeframe. If you start mixing static files with database objects to create the pages you have to develop two different ways of doing things. The first thing that comes to mind is revisions. Start small to end up large.
#9

[eluser]Colin Williams[/eluser]
Here's what "big" CMSs do. They sanitize the input for SQL attacks. Then they put in the code as is. Then when they display it, they use "input filters" (actually "output filters") to modify the stored code. This can do a number of things: Strip out script tags, php code, strip img tags maybe, etc. Whatever is desired (nothing at all even). If you are going to put the contents in a form input, then of course you need to escape the characters that would cause the form field to break. That's a no-brainer.




Theme © iAndrew 2016 - Forum software by © MyBB