Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Finding (near) duplicate JPG images 1

Status
Not open for further replies.

kmomberg

MIS
Jul 19, 2002
18
0
0
US
Has anyone come across a program or link to VB source code (preferably no DLL, OCX or temp files used) which can identify duplicate or near duplicate image files? I have already searched the web but find these are mostly shareware programs which do not give me the control over the images. I have numerous JPG files either downloaded from the web or converted from a digital camera which are taking up a large amount of disk space. Before I burn these to CD, I wanted to weed out any duplicates. If duplicates are found, I wanted to take the image with the most colors or higher resolution. I have downloaded Intel's IPP library but have not investigated too deeply on whether this is capable of easily performing this task.
 
I think you will find that this is a lot more difficult than you may think it is.
 
I have no illusion this is a simple undertaking hence the search for source code to get me started. I originally started by just comparing like file sizes then calculating a CRC on the files. Of course, I readily found this ignores similar files (different resolutions, comments in the file, same resolution but different file sizes, annotations on one image but not the other etc). I have looked at the JPG standard but found this is complicated enough in itself with the quantization tables and different block types to consider. Having a shell to work from would allow me to focus on the actual code to handle what to do when duplicates are found.
 
Who cares about the JPEG standard and all that that entails? That's not the difficult bit. The difficult bit is the comparison of two images of potentially different resolutions and different colour depths and trying to decide whether they are similar enough to be considered duplicates. In other words, image recognition.
 

You can get a lot of this information from the Wang/Kodak controls, and one of the ways I can think on how to do this is, you are going to have to have a base reference or common ground that you will have to convert one or both of the images to and then compare them, comming up with some type of, possibly weighted, algorthm.

Good Luck

 
It would seem as though I need to start by disecting an existing JPG image using the JPG standard so I know what the dimensions of the image are, the color depth, if compression is used etc to be able to convert one image into the same characteristics as another so that I can compare them. Certainly this would be much more effort than using a standard control as vb5prgrmr sugggests if the control were flexible enough to perform these manipulations on a large scale. The Wang/Kodak controls may be worth a look. Can you provide a link where I might find these with sample code?
 

With win9x to win 2k they come with the MS OS (.NET your out of luck). With win 95 there was a twain/samples (or something like that) that has a sample program, that showed how to use the controls. If you search this site for Wang/Kodak or imageedit you should find some decent examples. Unfortunatly the Wang/Kodak imaging tools do not allow you to save files as jpg or jpeg even though it allows you to use their compression standards (rename the file maybe???). You can also get some of the information you are looking for with a picture box (being width, height) or an image control, even though the Wang/Kodak controls will give you more info.

Good Luck

 
the problem with comparison of images in any form that are not exactly the same is that you have to decide what your tolerance is. So if you have 2 images of different size and color depth you could bring the larger image to the smaller image’s size with dithering. Then take the larger color depth and reduce it to the lower color depth. Then IF they are both actually the same picture you can start a scan and compare areas with some matrix calculations Like compare groups of pixels on both images within the same area and if they are within tolerance its ok….continue doing this for the whole image and eventually you’ll sum up a tolerance difference between the 2 images. If this is within your tolerance then you can say they may be the same. This whole process gets A LOT harder if the 2 images are slightly different in position such as in the case the smaller image is actually a crop of the larger image.

What you are asking for is some pretty detailed image processing and number crunching….something that honestly I wouldn’t try to do in VB. There must be better ways to find duplicates in your por#-)…..errr image collection then writing your own processor. [bigcheeks]
 
>some type of, possibly weighted, algorthm

And this is where it gets difficult. All the rest - JPEG sources, different resolutions, different colour depths, etc can all be more-or-less eliminated through a simple LoadPicture.
 
Problem with Load Picture is that on a 32bit display loading a picture of a given color depth does not change its color depth. The fact is that is still holds the same spectrum of colors. If it had only 256 colors (not many people use the greyscale capability in JPEG) then loading it up into a 32 bit environment doesn't mean it has now 4 billion colors.

comparisons using "some type of, possibly weighted, algorthm" can not effectively be implimented at the file level but needs to be implimented at a cluster level of pixels. File level computations will only be effective for 2 images that are exactly the same.

We had a client that wanted to do some advance processing on LANDSAT images doing comparisons between multiple images taken at different times looking for the ability to do automatic cloud cover estimations and determine which images where the best candiates for usage. Given the images are about 6000x6000 pixels with 7 bands this was a interesting prospect that proved far to difficult and not worth while.

Its amazing how much we take for granted that we can do at the macro level. Look and character recognition which essentially is a easier task than what is desired here.
 
You're quite correct Semper. From an image comparison, I would first look to the edge detection algorithms and proceed from there by comparing the relatively similarity of the edges. Not by any means is this a simple task. Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
CajunCenturion that is a interesting thing that I didn't consider. I should have thought of it too. I always try to break down problem in a natural way. Cats handle much of their image processing via edge detection. This cuts the problem down a lot...still bit more than I'd try in VB though.
 
Semper, sure, but once we've loaded our 256 colour image and our 24-bit image we can work on the actual pixels - a red pixel being red, whatever the depth of the source.

It is only the final analysis of whether the two images can be considered "the same" that is the tricky bit. Here, for example, is a short piece of code that provides a graphical comparison of two source images that can be of different resolutions and different color depths just to illustrate:

Dim lWidth As Long
Dim lHeight As Long

Picture1.BackColor = RGB(0, 0, 0) ' Baseline colour

' Choose your own source images...
Image1 = LoadPicture("c:\jpegs\a.gif")
Image2 = LoadPicture("c:\jpegs\test2.jpg")

lWidth = Image1.Width
lHeight = Image1.Height

If Image2.Width > Image1.Width Then lWidth = Image2.Width
If Image2.Height > Image1.Height Then lHeight = Image2.Height

Picture1.Width = lWidth
Picture1.Height = lHeight

Picture1.PaintPicture Image1, 0, 0, lWidth, lHeight, , , , , vbSrcInvert
Picture1.PaintPicture Image2, 0, 0, lWidth, lHeight, , , , , vbSrcInvert
 
I did not imagine I would get such a lively discussion. Thank you all.

I was thinking along the same lines as SemperFiDownUnda in that I would convert both pictures to a common size then divide up each picture into a grid of an arbitrary size (say 100x100 pixels). Compare each region in the grid for a match, add up all of the hits vs. all of the misses then calculate a hit ratio as a percentage. The higher the hit ratio, the more similar the files. Adjust the grid size smaller to say 10x10 to increase the chance of a hit for that block in the grid. I originally tried to use the LoadPicture and scale it to the size of the other picture but for some reason which escapes me at the moment, this did not seem to produce the result I was looking for. I am still trying to find the Wang/Kodak control on my W2K machine. If it is not in the default install script, it may have to be installed manually.
 
Frankly, I think that trying to use the Wang/Kodak controls will give you no more than the 10 line program I gave above, which does not worry about the particular format of the source images (e.g you can compare gifs agains jpegs) and does scale correctly and happily deals with different colour depths. It produces a result that consists of ONLY the differences between the two images (marred only by the fact that any rescaling will intoduce artifacts)

Here's a minor variant that results in an image which is a 'negative' of the differences between any two source images. All white pixels in the final image indicate that the pixels in both source images are the same, and all black pixels indicate where the source pixels differ. It would be pretty starightforward to do the simplistic subdivision analysis against this negative. But I'll reiterate that a proper analysis isn't easy. For example, say I have two images, an original and one that is modified only slightly in that it has had it's brightness increased by say about 1%. The difference between these two images to the human eye will be zero, but the negative mask produced by the following code will be completely (or almost completely) black, indicating that every single pixel between the two images are different.

You'll need a form with two Image controls, a Picturebox, an ImageList control and a command button:
[tt]
Option Explicit

Private Declare Function DrawIconEx Lib "user32" (ByVal hdc As Long, ByVal xLeft As Long, ByVal yTop As Long, ByVal hIcon As Long, ByVal cxWidth As Long, ByVal cyWidth As Long, ByVal istepIfAniCur As Long, ByVal hbrFlickerFreeDraw As Long, ByVal diFlags As Long) As Long
Private Const DI_MASK = &H1

Private Sub Command1_Click()
Dim lWidth As Long
Dim lHeight As Long


Picture1.BackColor = RGB(0, 0, 0) ' Baseline colour
Picture1.AutoRedraw = True

' Choose your own source images...
Image1 = LoadPicture("c:\jpegs\a.bmp")
Image2 = LoadPicture("c:\jpegs\blob.bmp")

lWidth = Image1.Width
lHeight = Image1.Height

If Image2.Width > Image1.Width Then lWidth = Image2.Width
If Image2.Height > Image1.Height Then lHeight = Image2.Height

Picture1.Width = lWidth
Picture1.Height = lHeight

Picture1.PaintPicture Image1, 0, 0, lWidth, lHeight, , , , , vbSrcInvert
Picture1.PaintPicture Image2, 0, 0, lWidth, lHeight, , , , , vbSrcInvert

ImageList1.UseMaskColor = True
ImageList1.MaskColor = RGB(0, 0, 0) ' Match baseline so we can produce a negative mask
ImageList1.ListImages.Clear
ImageList1.ListImages.Add 1, , Picture1.Image

DrawIconEx Picture1.hdc, 0, 0, ImageList1.ListImages(1).ExtractIcon.Handle, 0, 0, 0, 0, DI_MASK

End Sub
 
You point about subtle variations, invisible to the naked eye, are exactly why I suggested edge detection.

However, if you really want to proceed down the pixel comparison approach, I would suggest that you take each pixel, break it down into its separate color components, then comparing the color values, with a +/- factor to determine whether or not you have a match.

But as you and I both know, if you have two identical images, but lets say that one image is shifted 5 pixels to the left, then these two seeminly identical pictures will fail miserably with this type of pixel comparision approach. Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
Quite. And I'm definitely not suggesting a pixel comparison approach. On the contrary, I'm trying to demonstrate with code that that approach only works in the very, very simplest of scenarios. As I said right at the beginning of this thread (and as you and Semper have reinforced), this exercise is a lot harder than it superficially appears.
 
It sounds more and more like the pixel comparison is not the way to go. CajunCenturion: Can you briefly describe the theory behind edge detection?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top