Welcome Guest, Not a member yet? Register   Sign In
Sanitising user submitted HTML
#1

[eluser]TheFuzzy0ne[/eluser]
Hi, all.

I need to allow users to submit HTML, and I need to confirm that the HTML is valid. There is a PECL Tidy extension, but unfortunately I don't have access to it on the destination server. I don't think BBcode is not very fitting for what I need, as the client wants to be able to paste Word documents into the textarea. I just need to validate the code in a similar manner to the tidy extension, to ensure that all of the tags are closed correctly, and isn't going to break my site.

Any suggestions welcomed.
#2

[eluser]Colin Williams[/eluser]
Ran across this before: http://www.phpclasses.org/browse/file/21153.html

Haven't tried it out myself.
#3

[eluser]TheFuzzy0ne[/eluser]
Hi, Colin. Thanks for the link, but unfortunately it requires the PECL Tidy extension.
#4

[eluser]Dam1an[/eluser]
What about
http://htmlpurifier.org/ -- PHP5 only though, not sure if thats a problem for you
http://simonwillison.net/2003/Feb/23/safeHtmlChecker/ -- Seems very light weight, single file

I searched for the word pecl, and it didn't come up on either site, so should be ok Smile
#5

[eluser]slowgary[/eluser]
Just a quick note, I've seen content pasted from word go into a CMS and it usually includes a bunch of proprietary word tags that will screw with IE (but only IE). There's usually TONS of them and they look like this:
Code:
<mso:somethingSomething/>

So watch your back.
#6

[eluser]Thorpe Obazee[/eluser]
[quote author="Dam1an" date="1243563976"]What about
http://htmlpurifier.org/ -- PHP5 only though, not sure if thats a problem for you
http://simonwillison.net/2003/Feb/23/safeHtmlChecker/ -- Seems very light weight, single file

I searched for the word pecl, and it didn't come up on either site, so should be ok Smile[/quote]

I'd recommend htmlpurifier. I've used it and have tested it. and coupled with an xss filter.
#7

[eluser]Dam1an[/eluser]
[quote author="slowgary" date="1243573097"]Just a quick note, I've seen content pasted from word go into a CMS and it usually includes a bunch of proprietary word tags that will screw with IE (but only IE). There's usually TONS of them and they look like this:
Code:
<mso:somethingSomething/>

So watch your back.[/quote]

I know what you mean, I've written blog posts in Word cause I hate the wordpress new post page, and when you copy it in, it looks fine, but in the HTMl view, you get all the extra tags. It doesn't seem to cause any harm (checked in Firefox3 and IE7) but I'm sure it can't be good... there's 5 times as much 'junk' as there is content lol
#8

[eluser]TheFuzzy0ne[/eluser]
Good idea. I think I'm also going to have to strip out some of the style information, as just about every element has a style attribute containing the following at minimum:

Code:
style="border-width: 0px; margin: 0px; padding: 0px; vertical-align: baseline; font-family: inherit; font-weight: inherit; font-style: inherit; font-size: 100%; outline-width: 0px;"

It's no wonder MS Word documents are so big! All of that data is pretty much redundant.

I'm thinking of switching over from nicEdit to FCKEditor, as FCKEditor appears to convert a pasted word document into valid HTML, without all of those silly inline CSS styles. Pity... I really like nicEdit.

Thanks for your input.

Also, Dam1an, thanks for those links. I can't decide which one is better. I have access to PHP 5, but this will potentially be used on servers that may not support it.

EDIT: Hmmm, doesn't look like FCKEditor degrades gracefully...
#9

[eluser]garymardell[/eluser]
I'm sure it wouldn't take much to make FCKeditor degrade. Just have a textbox as normal and use a simple javascript command to replace it with the FCK editor.
#10

[eluser]xwero[/eluser]
Why not let the user upload a word document and after processing it display the content in the editor field. I think you can have better results that way.




Theme © iAndrew 2016 - Forum software by © MyBB