09.18.07

Breaking a Simple CAPTCHA

Posted by ryan in php


Not being much of an artist, my application of the GD library in PHP is more naturally applied in the deconstruction rather than the construction of images. Furthermore, when a client essentially asks me to break a simple CAPTCHA, I'm all but ready to start cracking.

Disclaimer: I realize that, for the obvious reasons of SPAMMING and similar activities, the breaking of CAPTCHAS is not at all a popular activity amongst the developing community. In fact, when I proudly told my friend that I had broken this simple CAPTCHA, his look was not one of disappointment, but rather betrayal. I do, therefore, want to say that this CAPTCHA was not broken for SPAMMING purposes, but rather to automate a process in a very specific industry. That being said, I cannot condone the breaking of CAPTCHAS for the purpose of spamming or performing any other illegal or immoral activity.

That being said, breaking a CAPTCHA is actually quite fine (assuming its as simple as this one). It's a bit of code breaking in the simplest sense, and it takes a little thought. Here's an example of the culprit CAPTCHA:

captcha_og.jpg

And here's the plan. As this is a basic CAPTCHA, I basically am going to try to separate each letter, and compare it with an "alphabet" of the CAPTCHA letters. The process looks a little like this:

  1. Remove the background shading on the image
  2. Paint the remaining pixels (the letters themselves) fully black
  3. Slice the image up into 4 pieces, one letter per piece
  4. Crop the whitespace out of the image
Additionally, we will need to create a library of the CAPTCHA 'alphabet" so that we can compare our processed letters with our library. So, after step 4, we'll need to add a temporary step of outputting those images, and assembling a library with them so that we have an image file for each letter and number used by the CAPTCHA system. Finally,, with our library in tact, we can repeat the above process and add the permanent last step of matching the processed letter with our library to output the true letter.

Step 1: Removing the background shading

First, note that the background is not always shaded with the same colors. Sometimes the shade has more of a red tint, other times it's more or less gray. To remove those pixels, we want to grab a sample of the background, then eliminate all the pixels within a certain range of them. First, let's load up our image into the GD library.

	$img_loc = '/home/ryan/captcha_example_00.jpg';
	$img = imagecreatefromjpeg($img_loc);
	if (!$img)	die ('Unable to open image');

Now, since the background is pretty simple, and in any case it contrasts well with the actual letters, let's simply grab a few background pixels and find the "average" color value of the background for our CAPTCHA

	/* Find a sample of the background color */
	sample_colors = array();
	$sample_colors[] = imagecolorat($img, 0, 0);
	$sample_colors[] = imagecolorat($img, 1, 0);
	$sample_colors[] = imagecolorat($img, 0, 1);
	$sample_colors[] = imagecolorat($img, 1, 1);
	$red = 0;
	$green = 0;
	$blue = 0;
	foreach($sample_colors as $indx) {
		$red += ($indx >> 16) & 0xFF;
		$green += ($indx >> 8) & 0xFF;
		$blue += $indx & 0xFF;
	}
	$red = ($red/count($sample_colors));
	$green = ($green/count($sample_colors));
	$blue = ($blue/count($sample_colors));

Basically, we use the imagecolorat function to find the color value of the 4 squares in upper left corner. This is a pretty blunt attempt at getting the average background color, but this is also a pretty basic CAPTCHA, so it works out. The assumption is that not letter extends into the upper right 4 corners and that those 4 corners more or less conform to the rest of the background. In this case, it seems to work out. The imagecolorat function returns the color index, which I then chose to change into the traditional RGB numbers with which I feel more comfortable. The 3 lines with the 0xFF in them represent this extraction from color index to RGB values. In the end we have our "average" background in terms of its red, green and blue value.

Next, let's go through and remove all of the pixels in our image that fall within some range of our average background color. In other words, if all three of the RGB values are within say, 30 points of our average, we consider it a background pixel and delete it. First, let's set our background remove level variable so that we can fool with it later.

	$bg_remove_level = 30; // represents the range around the average background level where pixels will be considered to be part of the background

Next, we'll do something that we'll eventually do over and over again. It's actually kind of neat really (for the completely un-artistic analytical thinkers). We iterate through the image pixel by pixel in order to determine if the pixel is background or part of a real letter.

	/* remove the background color from the image */
 	$x_total = imagesx($img);
 	$y_total = imagesy($img);
 	for($i=0; $i< $x_total; $i++) {
 		for($j=0; $j<$y_total; $j++) {
 			$loc_color = imagecolorat($img, $i, $j);
 			$loc_red = ($loc_color >> 16) & 0xFF;
 			$loc_green = ($loc_color >> 8) & 0xFF;
 			$loc_blue = $loc_color & 0xFF;
 			if ((abs($loc_red-$red)<=$bg_remove_level || $loc_red>$red) && (abs($loc_green-$green)<=$bg_remove_level || $loc_green>$green) && (abs($loc_blue-$blue)<=$bg_remove_level || $loc_blue>$blue)) {
 				/* color falls within background range */
 				imagesetpixel($img, $i, $j, imagecolorallocate($img, 255, 255, 255));
 			}
 		}
 	}

Now, I use the imagesx and imagesy function to return the size of the image in both the x and y direction. I then iterate first in x direction, and then in the y. Think of scanning the columns of the image, up and down, from left to right across the letter. Again, I use the imagecolorat function to return the color index at the current pixel and translate it to its RGB level. Now, the next line is a little bit more complicated. We want to do 2 things. First, if the current pixel has red, green and blue values within our $bg_remove_level of our average background color, then we regard it as a background color. So, the line originally looks like this:

	if (abs($loc_red-$red)<=$bg_remove_level && abs($loc_green-$green)<=$bg_remove_level && abs($loc_blue-$blue)<=$bg_remove_level) {

BUT, this really doesn't cover the whole story. Since the actual CAPTCHA text is always darker than the background image, then we automatically know that any pixels "lighter" in color should be regarded to as background pixels (even if they fall outside of the $bg_remove_level range). So, when we add the appropriate $loc_red>$red (aka 'lighter') to the if statement, it becomes the final:

	if ((abs($loc_red-$red)<=$bg_remove_level || $loc_red>$red) && (abs($loc_green-$green)<=$bg_remove_level || $loc_green>$green) && (abs($loc_blue-$blue)<=$bg_remove_level || $loc_blue>$blue)) {

Finally, if the current pixel is either lighter than the average background color OR darker but within the $bg_remove_level range, then we consider it to be a background pixel and want to "white it out". The imagesetpixel function sets the the color of the given pixel in the given image. The only caveat is that you must use the imagecolorallocate function when inputting the color that you want the pixel to be (this is so that the color you give matches the color profile of your image). So, when you want to make a pixel white, you would use imagecolorallocate($img, 255, 255, 255).

That's it for step 1, the results are below:

captcha_crappy.gif

But the images are completely clean, some background pixels aren't being removed. Not to worry, by playing around with our $bg_remove_level variable, we can turn up the intensity on our removal. The magic number for this application was more or less $bg_remove_level = 80; This value produced these very clean results:

captcha_white.gif

Step2: Paint the letters black

Now being experts in iterating through the image and painting things pixels, this next step drops quickly.

	/* Paint the remaining pixels perfectly black */
	for($i=0; $i< $x_total; $i++) {
		for($j=0; $j<$y_total; $j++) {
			$loc_color = imagecolorat($img, $i, $j);
			if ($loc_color != 16777215) {
				/* color falls within background range */
				imagesetpixel($img, $i, $j, imagecolorallocate($img, 0, 0, 0));
			}
		}
	}

The only real trick is that this time, for simplicity, instead of translating my imagecolorat output to RGB levels, I used it directly. The value 16777215 represents the image index value for the color white. In other words, if the current pixel isn't white, set the pixel black. We've now not only removed the background, but painted each letter perfectly black.

captcha_white_blackened.gif

Step3: Slice the image into 4 letters

Another advantage of this simple CAPTCHA is that all the original CAPTCHA image is always 60 pixels wide and each letter falls somewhere within the appropriate 15px range. That is, the first letter is between the 1st and 15th pixel, the second between the 16th and 30th pixel and so on. So, all we need to do to separate the letters, is slice the image into 4 equal parts.

	$img_1 = imagecreatetruecolor(15, 20);
	$img_2 = imagecreatetruecolor(15, 20);
	$img_3 = imagecreatetruecolor(15, 20);
	$img_4 = imagecreatetruecolor(15, 20);
	imagecopyresampled($img_1, $img, 0, 0, 0, 0, 15, 20, 15, 20);
	imagecopyresampled($img_2, $img, 0, 0, 15, 0, 15, 20, 15, 20);
	imagecopyresampled($img_3, $img, 0, 0, 30, 0, 15, 20, 15, 20);
	imagecopyresampled($img_4, $img, 0, 0, 45, 0, 15, 20, 15, 20);

First, we need to create 4 new image objects to hold our new pieces. We do that by using the imagecreatetruecolor function which takes the parameters 'x size' and 'y-size'. For this CAPTCHA, each image will be 15 pixels wide and 20 pixels high (which is the full height of the original CAPTCHA. Once these objects have been created, we can move into slicing the letters out of the originally image and into the 4 new images. To do this, we use the imagecopyresampled function, which takes in the following parameters:

  • the destination image object where the new image will be sampled to
  • the original source image
  • the destination x coordinate, where the image will be sampled to
  • the destination y coordinate, where the image will be sampled to
  • the source x coordinate, where the image will be sourced from
  • the source y coordinate, where the image will be sourced from
  • the destination x size
  • the destination y size
  • the source x size
  • the source y size
Obviously, this function can do a lot more than we're going to use it for (i.e., resize images). We simply use the function to cut a piece out of the original function and paste it in the new image without resizing it.

E_nocrop.gif I_cropup.gif e_nocrop.gif Q_nocrop.gif

Step 4: Crop the whitespace out of the image

Because the letters can appear anywhere within their 15x20 block, we want to crop out the whitespace so that we're left simply and beautifully with just the CAPTCHA letter. Since we are now dealing with 4 images, and thus any process must be repeated for each, we create a function to do our cropping.

	private function autoCrop($img)
	{
	 	$x_start = 0;
		$x_end = imagesx($img);
		$y_start = 0;
		$y_end = imagesy($img);
		$x_set = FALSE;
		$y_set = FALSE;
		
		for($i=0; $i< imagesx($img); $i++) {
			for($j=0; $j<imagesy($img); $j++) {
				$loc_color = imagecolorat($img, $i, $j);
				if ($loc_color != 16777215 && !$x_set) {
					/* scanning horizontally, we've found a black pixel */
					$x_start = $i;
					$x_set = true;
				}
				if ($loc_color != 16777215) {
					/* scanning horizontally, we've found a black pixel */
					$x_end = $i+1;
				}
			}
		}
		for($i=0; $i< imagesy($img); $i++) {
			for($j=0; $j<imagesx($img); $j++) {
				$loc_color = imagecolorat($img, $j, $i);
				if ($loc_color != 16777215 && !$y_set) {
					/* scanning vertically, we've found a black pixel */
					$y_start = $i;
					$y_set = true;
				}
				if ($loc_color != 16777215) {
					/* scanning vertically, we've found a black pixel */
					$y_end = $i+1;
				}
			}
		}

		$img_tmp = imagecreatetruecolor($x_end-$x_start, $y_end-$y_start);
		imagecopyresampled($img_tmp, $img, 0, 0, $x_start, $y_start, $x_end-$x_start, $y_end-$y_start, $x_end-$x_start, $y_end-$y_start);
		return $img_tmp;
	}

This function is a bit confusing, and probably not the best way to do this, so I'll spare all the details. Essentially, it scans vertical lines in a horizontal direction, looking to see if there are any non-white pixels (letters). When it finds one, it know that it can only crop in the x direction from the beginning of the image until that spot. It then continues its scan, marking each time the scan finds a black image. When its done, it also knows the location of the last vertical line that contains a non-white pixel. it knows that it can crop from the end of the image to that spot in the x direction. This process is then repeated to find the y-coordinates for cropping. Once we know how much we can crop, we again use the imagecopyresampled function to copy only part of the image (the "middle") to a new image.

E_full.gif I_full.gif e_full.gif Q_full.gif

Temporary Step: Creating an image library

This step is tedious but important. Since our algorithm will give us individual letters, we need a library of these individual letters so that we can compare the 2 and output the true letter. In other words, if the above function outputs picture of the letter "h", we want to have our own letter "h" from the CAPTCHA so that we can compare the 2 and determine that is is in fact the letter "h". The good news is that our code is ready and primed to start creating this library for us, at least in part.

		imagegif($img_1, '/home/ryan/captchas/1.gif');
		imagegif($img_2, '/home/ryan/captchas/2.gif');
		imagegif($img_3, '/home/ryan/captchas/3.gif');
		imagegif($img_4, '/home/ryan/captchas/4.gif');

This code outputs the cleaned up and separated CAPTCHA "letters" to gif file. We use a gif file here because we only have 2 colors in our images, black and white, which is much more appropriate for the gif format (especially in comparison with jpeg, which would blend the white background with the black images and distort all our hard work). When you run your code now, you should get, in gif form, an output of 4 letters (hopefully 3-4 of which are unique). When I did this, I created a sub-folder called library and started renaming the images to whatever letter or number they truly hold and putting them in that directory. Note that this required refreshing the page with the CAPTCHA on it and downloading 40 or so of the originally images. Then, one by one, I let the code sort through the lot, renamed the cropped letters/numbers, and eventually created a full library of letters/numbers that the CAPTCHA could output. So, if the code produces an image of the letter "n", we can compare it with our library and verify that it is an "n" in fact.

Final Step: Comparing the CAPTCHA with the library

Finally, we can delete the above imagegif output code and replace it with a much more useful call to the new function match_character. In my code, it looks like this:

		/* Match each image with a real character */
		$char_1 = $this->match_character($img_1);
		$char_2 = $this->match_character($img_2);
		$char_3 = $this->match_character($img_3);
		$char_4 = $this->match_character($img_4);

The match_character function is where the real work is done. It looks like this:

	private function match_character($img)
	{
		$chars = array('0', '2', '3', '4', '5', '6', '7', '8', 'A', 'B', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'P', 'Q', 'S', 'T', 'U', 'V', 'Y', 'Z', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'q', 's', 't', 'u', 'v', 'x', 'y', 'z');
		$char_dir = '/home/ryan/captchas/letters/';
		$char_report = array();
		
		foreach($chars as $char) {
			$img_char_temp = imagecreatefromgif($char_dir.$char.'.gif');
			if (imagesx($img_char_temp) == imagesx($img) && imagesy($img_char_temp) == imagesy($img)) {
			
				$img_char = imagecreatetruecolor(imagesx($img_char_temp), imagesy($img_char_temp));
 		 		/* resample gif image to have jpeg properties */
				for($i=0; $i< imagesx($img_char); $i++) {
 		 			for($j=0; $j<imagesy($img_char); $j++) {
							if (imagecolorat($img_char_temp, $i, $j)==0) {
								imagesetpixel($img_char, $i, $j, imagecolorallocate($img_char, 0, 0, 0)); //black
							} else {
								imagesetpixel($img_char, $i, $j, imagecolorallocate($img_char, 255, 255, 255)); //white
							}
						}
					}

				$char_report[$char] = 0; // initialize color difference to zero
				for($i=0; $i< imagesx($img_char); $i++) {
					for($j=0; $j<imagesy($img_char); $j++) {
	 		 			if (imagecolorat($img, $i, $j) != imagecolorat($img_char, $i, $j)) {
							/* pixels do not match */
							$char_report[$char] += $this->colorDiff(imagecolorat($img, $i, $j), imagecolorat($img_char, $i, $j));
						}
					}
				}
			}
		}
			
		$best_char = '';
		$low_score = 10000;
			
		foreach($char_report as $char=>$score) {
			if ($score < $low_score) {
				$low_score = $score;
				$best_char = $char;
			}
		}
		return $best_char;
	}

Now, for the run through. In theory, it's quite simple: look through the library, find the most similar image, and output it's real character value. What I first do is define what my full library is. Notice that my array $chars doesn't contain every character, some apparently were not used by the CAPTCHA generator. Next, I iterate through my $chars and load each image from my character library. Since, the code does such a good job of removing the background, I know from trial and error that a letter will always have the exact same dimensions each time it comes through the code. Thus, the first thing I check for is image size: is the loaded image from the library the same as the image that has been processed from the CAPTCHA. If it is, then we keep going.

Since we'll be comparing color values, I found it helpful to convert the library image (in gif format) to the jpeg color scheme (which matches that of the input jpeg format from the CAPTCHA). Basically, I iterate through the library image, and paint a new image pixel by pixel, painting each pixel either black or white. The approach is a little archaic, but it gets the job done.

Next, I want to know how well a letter from the library and the letter from the CAPTCHA match. To do this, I decided to total up the differences in color from each pixel of the 2 images. Therefore, I went pixel by pixel, added the color difference up, and kept track of the result. In this case, I used a function that I wrote called colorDiff which basically averages the difference in the red, green and blue colors for each input pixel. In reality, this is unnecessary because both images contain only black and white pixels, so the difference is either 0 or 255.

Finally, we should have an array or "color scores" that represents how similar the CAPTCHA letter is to all the letters in the library that share its exact dimensions. The lowest score obviously wins, and is returned by the function.

Conclusion

Breaking a CAPTCHA is more of an art than a science, especially if you're still a big GD library novice like I am. This is a very simple CAPTCHA and although this explanation is lengthy, the process of breaking it was straightforward at the least. More difficult CAPTCHAS (especially those involving different fonts) would be much more difficlut to break. In the world of letter recognition (the process of recognizing letters from written text), algorithms are used to map the distance from the source letter (handwritten) and the example letter. Statistical techniques, like least squares, is then used to determine the best fit. While we do some of that here, this method cuts some corners due to the fact that there is only one font, and if everything works as it should, the output letter will match exactly with its partner in the library (meaning that my $char_report and colorDiff were unnecessary, because some letter will always score zero unless something went astray). Furthermore, the breaking of each CAPTCHA is unique, and this method would not work directly on ANY other generator. That being said, I found this exercise to be both educational and entertaining. It's code breaking for the common man.

Thanks for the shares!
  • StumbleUpon
  • Sphinn
  • del.icio.us
  • Facebook
  • TwitThis
  • Google
  • Reddit
  • Digg
  • MisterWong
Posted by Shaheer on 2010-06-18
You just wrote a captcha breaking script while you have a captcha implementation of your own on this website :D
Posted by errorisme on 2011-05-30
i just do like that, but my target captcha is randomly rotated. im confuse what next to do.. hmm..
Posted by Adriano C. de Moura on 2011-08-12
unable to work, had a problem with function colorDiff, you can send me all the code ?

thanks
Posted by matt on 2011-09-21
thanks so much for this! one problem at the end though you don't show us the colorDiff function.. Nothing works without it
Posted by Azithromycin dosage on 2012-02-08
Furthermore, the breaking of each CAPTCHA is unique, and this method would not work directly on ANY other generator. That being said, I found this exercise to be both educational and entertaining. It's code breaking for the common man.