Login

09-21-2009, 06:25 AM

[eluser]bugboy[/eluser]
Hi all

I need to scan a block of body copy that will replace certain words within the copy with other words.

I can do this is and got this part working.

How ever i need it to replace only words out side of any html tags. <.*> </.*> < />

So if the word appears in a link its ignored.

Any ideas on how to do this?

my code so far

PHP Code:

Code:
$patterns[] = '/'.$tag.'(?![^<]*>)/i'; 

$replacements[] = '<a  href="'/'.$tag.'">'.$tag.'</a>';

Thanks

09-21-2009, 07:19 AM

[eluser]sophistry[/eluser]
three things which won't solve your issue, but may help you move forward more swiftly:

- if you want to only replace tags OUTSIDE of html tags, then why does the replacements var show $tag IN an anchor tag?
- which HTML tags? is it only text in anchor tags you need to protect? or is there some other tag?
- why do you use a negative assertion ?! and then a negated character class [^<] in the patterns var? that's confusing to have a double-negative in a pattern. you could probably simplify that. EDIT: also, incidentally, i see the star has no repeat on it which means zero or one character.

cheers.

09-21-2009, 07:50 AM

[eluser]bugboy[/eluser]
HI sophisty

1. sorry my mistake it should be.

Code:
$patterns[] = '/'.$tag.'(?![^<]*>)/i'; 

$replacements[] = '<a  href="'/'.$title_tag.'">'.$title_tag.'</a>';

this is just a snippet of code found within a foreach loop

EDIT: due to it removing elements

2. it needs too ignore

Code:
links: <a> and ignore in here</a>

images <img  />

and within them for example the src=" " and href=" "

3. im not to hot as you can see on regex so i suppose im muddling through but willing to learn

If you can see any better way then that would be great.

This isn't easy to understand and i have always found this difficult.

09-21-2009, 08:10 AM

[eluser]sophistry[/eluser]
ok, that's clearer: you just need to ignore "a" and "img" tags.

so, even though you changed tag to title_tag, the replacements var still shows the a href= being replaced. i am pretty sure you are saying you don't want to replace the part inside the href= so i am confused why it is in there...

please clarify.

regex is easy if you just treat it like a super compact programming language. just learn it like you'd learn any other!

cheers.

09-21-2009, 08:20 AM

[eluser]bugboy[/eluser]
yeah

i want to replace $tag with a a link to $title_tag

Code:
$string = preg_replace($patterns, $replacements, $string, 1);

This would the replace the the first instance of that word with a link to a page about that word.

(a really simple example to show what i mean)

Code:
$tag = 'hello';

$title_tag = "hello_page";

$patterns[] = '/'.$tag.'(?![^<]*>)/i'; 

$replacements[] = '<a  href="'/'.$title_tag.'">'.$title_tag.'</a>';

The thing is if there is a link already in the copy that has hello in it the i'd get nested links

09-21-2009, 08:42 AM

[eluser]sophistry[/eluser]
ok, i get it now.

you have a string that is a mix of text and HTML and you just want to replace certain "tags" (which is a single word made of only alpha numerical characters) in the text part with an HTML anchor tag (link), but if the "tag" (again, just a single word of alnum chars) appears inside an href or src attribute you want to skip it (i.e., not replace).

Code:
$pattern = '/\s([a-z0-9]+)[.,;:\s]/i';

what about starting with something like that? this regex will ignore any text inside an href or src attribute because those (by definition) will not be bounded by space characters.

it will find text inside the anchor open and close however. an easy way to avoid that is to do a three-stage replace where the first stage finds the text inside the anchors

Code:
<a href="path/to/resource">find and protect this text</a>

. the second stage does the pattern above and the third stage reverses whatever you did to protect the text inside the anchor tags.

cheers.

09-21-2009, 09:16 AM

[eluser]bugboy[/eluser]
yeah i think i get you

so it would go something like this.

1. find all links and protect the the internal parts
2. find and replace specific words
3. return links back to normal

this code here how would i use it with the the tag i'm confused about how it fits together?

Code:
$pattern = '/\s([a-z0-9]+)[.,;:\s]/i';

Also sometime the tag to replace maybe a couple or of words does that throw any spanners into the works?

Also thanks for all your help on this. Its great.

09-21-2009, 09:39 AM

[eluser]sophistry[/eluser]
yes, tags with spaces makes it a lot more complicated! ;-)

basically, it will be easier for you if you approach it in a multi-step way rather than trying to stuff everything into a giant regex.

another approach is to strip out the "a" and "img" tags, do the replacement and stuff the a and img tags back in where they were removed. this could get pretty devilish too.

experiment and post your code and results.

09-21-2009, 09:43 AM

[eluser]bugboy[/eluser]
Yeah that also make sense.

So how would i use the pattern with the tag as i can't get my head around this.

Something like this?

Code:
$patterns[] = '/'.$tag.'\s([a-z0-9]+)[.,;:\s]/i';

09-21-2009, 10:17 AM

[eluser]sophistry[/eluser]
no. not like that. back to the regex tutorials! i like http://regular-expressions.info

sorry, in a rush and i sort of left a few things out! specifically, your tag list. the pattern i sent will find *every* word and allow you to handle it as a backreference in your replacement routine. but, of course, that's not exactly what you are looking for is it?

last year, stensi and i rewrote the word_censor() function. you can see the final version with documentation here:

http://ellislab.com/forums/viewthread/91437/P15/#464484

the last post shows something similar to what you are after... it lists a bunch of words and shows them being replaced with an anchor tag.

why don't you try putting that to work and see how it goes?