01.27.09

How to match and replace content between two html tags using regular expressions

Posted by ryan in php, regex


See this regular expression in action

Regular Expression Tester - play with this and other regex right on the site.

Here's the problem. For this blog, I want to be able to put code (inside pre tags) right inside of my articles, which are stored in the database. When I display the articles, I want to run the php function htmlentities() on everything between my pre tags so that all the code is properly escaped.

Here's our imaginary article source code (pulled from the database):

<h1>An article with PHP code</h1>

<pre>
  <?php echo 'hi'; ?>
  <br/>
  <?php echo 'oh hey there'; ?>
</pre>

<p>
  The above and below pre tags will be
  rendered as code on the screen
</p>

<pre>
  <?php echo 'hello for a second time'; ?>
  <br/>
  <?php echo 'yep, here we are again'; ?>
</pre>

<p>
  Thanks for reading!
</p>

To make the code (everything between pre tags) appear correctly when output, we need to run it through the htmlentities method. In other words, we need to isolate all content that comes between the <pre> and </pre> tags. Here's how we do it.

// $content holds your raw content

$content_processed = preg_replace_callback(
  '#\<pre\>(.+?)\<\/pre\>#s',
  create_function(
    '$matches',
    'return "<pre>".htmlentities($matches[1])."</pre>";'
  ),
  $content
);

Your $content_processed variable now holds the processed version of your article. That's it!

How it all Works

The above code runs everything between a pre tag through the htmlentities function. This is just one way you may need to process your content. Let's look more closely at how this works.

The pattern that matches our pre tags and their content is:

#\<pre\>(.+?)\<\/pre\>#s

If you're relatively new to regular expressions, the two pound (#) symbols may look strange, but they're harmless. Perl regular expressions (which we're using here) must always start and end with a delimiter. The # symbol is used here, but / is probably the most common delimiter. Delimiters appear at the start and end of the string you want to match. Any characters appearing after the delimiter (in this case there is an s at the end) have special meaning. In this case, the 's' after the final # delimiter means that matches can be found over multiple lines. Without this, the content inside of our pre tags would all need to be on the same line to match.

One more very important piece of our match is the (.+?) portion. Apart from the question mark (?), this is straightforward regex, which basically says to match 1 or more of any character. This is the portion of our code that captures the contents between our pre tags. The question mark (?) is very important. Normally, regular expressions are "greedy". This means that it'll always look for the LAST instance of what it's searching for in your string. In this case, if you have multiple pre tags, it'll match the entire string between the first pre tag and the last pre tag:

#\(.+?)\<\/pre\># matches ALL of the following:
<pre>
  <?php echo 'hi'; ?>
  <br/>
  <?php echo 'oh hey there'; ?>
</pre>

<p>
  We should make one more code tag just to make sure
  we've got everything right:
</p>

<pre>
  <?php echo 'hello for a second time'; ?>
  <br/>
  <?php echo 'yep, here we are again'; ?>
</pre>
but #\(.+?)\<\/pre\>#s matches the 2 following pieces individually
<pre>
  <?php echo 'hi'; ?>
  <br/>
  <?php echo 'oh hey there'; ?>
</pre>
AND
<pre>
  <?php echo 'hello for a second time'; ?>
  <br/>
  <?php echo 'yep, here we are again'; ?>
</pre>

Obviously, the second result is what we want because the first (without the 's') matches all of the text in between the 2 pre blocks in addition to the pre blocks themselves. Placing the 's' after the ending delimiter tells the function to match in a non greedy fashion, meaning that it'll look for the first occurrence of a match, not the last. In other words, if you neglect the 's', you'll match too much (greedy).

$content = preg_replace_callback(
  '#\<pre\>(.+?)\<\/pre\>#s',
  create_function(
    '$matches',
    'return "<pre>".htmlentities($matches[1])."</pre>";'
  ),
  $content
);

The rest of the function is fairly simple. In order to process the code between our pre tags, we create a function using create_function that does exactly that. The syntax is a little confusing, but the above code simply replaces each pre block with a pre block whose contents have been run through the htmlentities function.

Summing Up

If you need a regular expression that will match the content in between two html tags, use the following:

  preg_match('#\<pre\>(.+?)\<\/pre\>#s', $html_content, $matches);
Thanks for the shares!
  • StumbleUpon
  • Sphinn
  • del.icio.us
  • Facebook
  • TwitThis
  • Google
  • Reddit
  • Digg
  • MisterWong
Posted by Sriram on 2009-04-10
Wonderful article ryan :)
Posted by Sriram on 2009-04-21
But how to parse if inner tags>


some contents


It will stop parsing as follows

some contents
Posted by Ryan on 2009-04-21
@Sriram - I think my comment box cut off some characters in your comment - I'll shoot you an email and see if we can work out your situation.

-Ryan
Posted by Yuras on 2010-02-07
Very nice and useful article:) Thanks
Posted by Roi on 2010-03-01
Great article.
How could I make replace for any text the located between HTML tags.
for example, if I want to replace the word "php" with asp in the following text:
myphp best php website PhP!!! php and myphp or phpme - php!!
How can I creat the following resualt:
myphp best asp website asp!!! aspand myphp or phpme - asp!!

?
thank you in advance,
Roi
Posted by Roi on 2010-03-01
Great article.
How could I make replace for any text the located between HTML tags.
for example, if I want to replace the word "php" with asp in the following text:
myphp best php website PhP!!! php and myphp or phpme - php!!
How can I creat the following resualt:
myphp best asp website asp!!! aspand myphp or phpme - asp!!

?
thank you in advance,
Roi
Posted by Matt on 2010-04-18
This helped me out a bunch. Thanks!
Posted by bingo on 2010-07-09
Good stuff. very well-written article!
Posted by John on 2010-08-05
I am looking for something simialar to what you have done here. I just couldnt figure it out. I am looking to search and replace text that is bettween html tags without having to identify the tags. Is that possible?
Posted by Ryan on 2010-08-05
Hey John-

Yes, you could do that with a regex that looks something like this:

#\]+)\>(.+?)\]+)\>#s

The problem you're going to run into is if you have any embedded html tags inside the tag you're trying to match. The above expression will stop at the first that it finds.
Posted by John on 2010-08-05
Cool! Now, I guess if this also voided any characters that were in quotes, then it actually might be solid. But, thats not possible is it?
Posted by kode on 2010-11-23
I use your regex on my web log but something appear to be wrong. Can you post functions for saving and editing also?
Posted by Frederik on 2010-12-21
Nice article, just what I was looking for! Thnx!
Posted by teerex on 2011-01-18
Thanks, Ryan!
Used your code for search-replace function of tag contents:
function tagreplace($content,$tag,$search,$replace){
$content_processed = preg_replace_callback(
'#\(.+?)\#s',
create_function(
'$matches',
'return "".str_replace("'.$search.'","'.$replace.'",$matches[1])."";'
),
$content
);
return $content_processed;
}
Posted by Maninder Dhiman on 2011-09-19
Error : Warning: preg_replace_callback() [function.preg-replace-callback]: No ending delimiter '#' found in

Please help i used your above code
Posted by Ryan on 2011-09-19
@Maninder

Check your regular expression - you're probably just missing the closing "#" at the end of it. All regular expressions must open and close with the same "delimiter". Like

#foo#

or

/foo/
Posted by Vincent on 2011-12-06
Thanks a lot for this simple piece of code. It helped me a lot!
Posted by George on 2012-04-08
Just wanted to say thank you.
Posted by Fergal Andrews on 2012-06-01
Thank you Ryan. A very helpful article.