• 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex to remove html tags in a block

#1
[eluser]bugboy[/eluser]
Hi all

I'm trying to remove blocks of html from stings of text.
Once they are removed i place a marker where they were and then run some code before adding them back in.

This regex does it for most html tags apart from image tags <img />

Code:
"|<[^>]+>(.*)</[^>]+>|U"

I also want to try and remove whole blocks of html from a string.


so for example say i have a string that looks like this. This contains links, images and youtube.

Code:
Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus. This is <a href="http://example.com/" title="Optional Title Here">title for this link reference-style link. This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus. This is <a href="http://example.com/" title="Title">an example</a> inline link.  Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus.

<img src="/media/display/Photo_10.jpg" alt="Photo_10.jpg" />

&lt;object height="350" width="425"&gt;
    <param name="movie" value="http://www.youtube.com/v/bvWQNa1czG4" />
    <param name="wmode" value="transparent" />
    &lt;embed src="http://www.youtube.com/v/bvWQNa1czG4" type="application/x-shockwave-flash" height="350" wmode="transparent" width="425" /&gt;
&lt;/embed&gt;&lt;/param></param>&lt;/object&gt;

i would like it to be outputtd like this.


Code:
Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus. This is {*0} title for this link reference-style link. This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus. This is {*1} inline link.  Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus.

{*2}

{*3}


I can't seem to figure it out i get so far with it and then it breaks.

Any help would be greatly appreciated.

Thanks for your time

#2
[eluser]Sbioko[/eluser]
Try using this:
Code:
/\<(.*)\>(.*)\<(.*)\/\>/u

#3
[eluser]bugboy[/eluser]
Cheers for that

That didn't seem to work.

However this is doing the trick. Can this be optimised?

Code:
$regex = "(<[^>]+>.+?</[^>]+>|<[^>]+/>)si";

#4
[eluser]Sbioko[/eluser]
What do you mean? What you need to optimize here? I'm not actually a Regex master, but I know something about it, so I think that s and i should not be here.
Code:
/<[^>]+>.+?</[^>]+>|<[^>]+/>/u

#5
[eluser]bugboy[/eluser]
ahh your right about the (i)

I added the s in to treat it newlines as strings and it does the trick on the youtube code

I say optimised as someone may have a more efficient way of doing it that;s less CPU heavy.

Not to worry though for the time being its working.

#6
[eluser]Phil Sturgeon[/eluser]
http://stackoverflow.com/questions/17323...54#1732454

Quote: 2587 votes


You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

#7
[eluser]Sbioko[/eluser]
I don't know is this for you, but try to Google "htmlSQL".

#8
[eluser]bugboy[/eluser]
Well I'm not parsing html as such i'm just removing tags.

That code works so i'm happy.

Smile

#9
[eluser]Sbioko[/eluser]
What code? :-) htmlSQL? If you want just to remove tags and that's all, just use:
Code:
strip_tags($html);
.

#10
[eluser]bugboy[/eluser]
yeah i needed to do more then that.

I needed to get the html tags and store them.

otherwise strip_tags would of been the way to go.


Digg   Delicious   Reddit   Facebook   Twitter   StumbleUpon  


  Theme © 2014 iAndrew  
Powered By MyBB, © 2002-2020 MyBB Group.