Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Finding (near) duplicate JPG images 1

Status
Not open for further replies.

kmomberg

MIS
Jul 19, 2002
18
0
0
US
Has anyone come across a program or link to VB source code (preferably no DLL, OCX or temp files used) which can identify duplicate or near duplicate image files? I have already searched the web but find these are mostly shareware programs which do not give me the control over the images. I have numerous JPG files either downloaded from the web or converted from a digital camera which are taking up a large amount of disk space. Before I burn these to CD, I wanted to weed out any duplicates. If duplicates are found, I wanted to take the image with the most colors or higher resolution. I have downloaded Intel's IPP library but have not investigated too deeply on whether this is capable of easily performing this task.
 
Have I opened a can of worms? :-D

The basic theory behind edge detection is that you are looking for a substantial change in color as you move from one pixel to the next, and to link these changes together. Think of it as trying to draw an outline of the image, much as you would have in coloring book, before the image is colored. You trying to define the edges, or the various shapes that make up the image.

What defines "substantial" is subjective and I would suggest that it be a parameter because if you too fine, you get too many edges to be useful, and if not fine enough, then you won't get enough edges.

Due to the proprietary nature of the code, I cannot post any of it, but to get started, try to implement the following algorithm.

Firstly, convert the image to a BMP format where you can address each pixel individually. Get the dimenstions of the image, then

for each line of the image
read the first byte
split into individual RGB components
set to current R,G, and B values
for the rest of the pixels on that line
read that pixel
split into RGB components
compare to the current color values
if any color change > some threshhold
write black pixel to file
else
write white pixel to fil
end if
loop
loop

Trying to give you a general idea of a faily simple approach. You'll need to tweek your threshhold value to get the fineness of your detected edges, but what you should end up with is a black/white representative image.

Once your satisfied with that, then you can move onto the next stage of comparing the two B/W shape representations.

Now, there are more sophistocated edge detection filters, but they (at least the couple that I'm somewhat familiar with) require calculus, specifically finding the min and max value of the 1st derivative, or dealing with 2nd derivative gradiants, and if you need to go in this direction, we'll deal with that later on.

Its also quite possible that you might be able to find the source code for an edge filter by doing a google search. Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 

strongm, yours, because the RGB of the two different (same pictures different color depth) pictures will be of different values even if the eye cannot tell the difference.

After deleting a long dissertation on this subject yesterday (not posting it), I can see everyone makes some good points, and are suggesting various weighted algorithms, but let me throw the short version of what I was going to say.

Pattern matching. Looking for very similar blocks of x and y size based upon each pictures size (roving so as not to say area1width = width / 10...), and yes this would need to be weighted (averaged, programmed in tolerances, however you want to say it.), along with relative location zoning. Similar to facial or thumbprint image recognition with a bit more leeway.

Now, one of the easiest ways to tell if two pictures are the exact same (one picture copied to another location) is the MD5 message digest algorithm. I know that you are looking for similar pictures but to delete the absolute duplicates you could use this for run one.

Good Luck

 
Upon reading over the algorithm again, after the first cup of coffee this morning, I see where I left out an important item.

for each line of the image
read the first byte
split into individual RGB components
set to current R,G, and B values
for the rest of the pixels on that line
read that pixel
split into RGB components
compare to the current color values
if any color change > some threshhold
write black pixel to file
else
write white pixel to fil
end if
reset the color values
loop
loop

Resetting the color values is also a somewhat subjective process. The simplest approach is to sinply reset the color values to the ones just read. Another approach is to average the color values of the last 2, 3, 4 or other number of pixels. Averageing helps reduce the impact of noise in the picture, and helps to find edges that are more gradual in nature. Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top