Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to determine file type of a file, e.g. test.doc 1

Status
Not open for further replies.

StuckInTheMiddle

Programmer
Mar 3, 2002
269
US
Hi Guys,

Did a forum search and couldn't find precisely I needed so hoping someone can help me out.

I would like to check/validate that the files my users submit to my ASP.NET website are valid MS Office documents.

Currently I am checking the file extension of the filename, easy enough, and i'm also making sure the file size is not something silly, but now my Boss wants us to verify that a file called test.doc really is a word document (and test.docx I guess for office 2007).

I couldnt find anything on the FileInfo object that would let be get to the filetype of the file.

I did see an old API example to do this but was hoping to avoid API calls if possible. Any help appreciated.

"If you can stay calm, while all around you is chaos...then you probably haven't completely understood the seriousness of the situation.
 
You could use System.Text.Encoding.ASCII.GetString to get the first x number of bytes from the file which may give an indication of what it is. I don't think you'll get any 100% foolproof method though.


-------------------------------------------------------

Mark,
[URL unfurl="true"]http://aspnetlibrary.com[/url]
[URL unfurl="true"]http://mdssolutions.co.uk[/url] - Delivering professional ASP.NET solutions
[URL unfurl="true"]http://weblogs.asp.net/marksmith[/url]
 
Thanks ca8msm.

I was hoping there'd be something that would tell be 'Word.Document' or 'Microsoft Word Doc' as I can see these strings clearly in the file (at the end) when opened in a text editor. But your suggestion would at least allow me to check, but I agree it doesn't sound very foolproof.

thanks again, tho

"If you can stay calm, while all around you is chaos...then you probably haven't completely understood the seriousness of the situation.
 
I've decided to just use the API call to 'SHGetFileInfo' in the shell32.dll. I figured that's much better than trying to parse through a string of binary data of the word doc.

A,

"If you can stay calm, while all around you is chaos...then you probably haven't completely understood the seriousness of the situation.
 
It returns the text string 'Microsoft Word Document' which I see embedded in .doc files. Not sure if that's the MIME type, but it's good enough for my needs for now.

"If you can stay calm, while all around you is chaos...then you probably haven't completely understood the seriousness of the situation.
 
Would you mind sharing the code you have to make the API call? I have never done anything like that, and I am curious. Thanks
 
Sorry guys, been on vacation and didn't see the post. Will hunt down the code for you from work tomorrow and post then.

"If you can stay calm, while all around you is chaos...then you probably haven't completely understood the seriousness of the situation.
 
file extensions are only meaningful to windows. linux, and macs do not require file extension. instead the file headers contain information about which program the file relates to.

maybe linux/mac files are not an issue in this scenario, but it's worth noting.

either way the most accurate method is the api calls.

Jason Meckley
Programmer
Specialty Bakers, Inc.
 


Here's the code


Code:
            SHFILEINFO shinfo = new SHFILEINFO();
            IntPtr i = Win32.SHGetFileInfo(myfilepath, 0, ref
            shinfo,(uint)Marshal.SizeOf(shinfo),Win32.SHGFI_TYPENAME);
            string s = Convert.ToString(shinfo.szTypeName.Trim());
            textBox2.Text = s;

you need to create a class called 'GetFileTypeAndDescription' and add 'using GetFileTypeAndDescription' to the page using the above code.

The code for the GetFileTypeAndDescription class is as follows


Code:
using System;
using System.Runtime.InteropServices;

namespace GetFileTypeAndDescription
{
    class Class1
    {
 

    }

    [StructLayout(LayoutKind.Sequential)]
    public struct SHFILEINFO
    {
        public IntPtr hIcon;
        public IntPtr iIcon;
        public uint dwAttributes;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
        public string szDisplayName;
        [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 80)]
        public string szTypeName;
    };

    class Win32
    {
        public const uint SHGFI_DISPLAYNAME = 0x00000200;
        public const uint SHGFI_TYPENAME = 0x400;
        public const uint SHGFI_ICON = 0x100;
        public const uint SHGFI_LARGEICON = 0x0; // 'Large icon
        public const uint SHGFI_SMALLICON = 0x1; // 'Small icon

        [DllImport("shell32.dll")]
        public static extern IntPtr SHGetFileInfo(string pszPath, uint
        dwFileAttributes, ref SHFILEINFO psfi, uint cbSizeFileInfo, uint uFlags);
    }
}

This works great for what I wanted using an API call on a word document returns 'Microsoft Word Document' which allows my application to verify what the user uploaded was genuinely a word doc. It will work on other office file types, and i'm assuming other files too although I haven't attempted to do so. This is much better than just checking on the extension of the file which could be wrong. Don't recall the original source of this code.

A,




"If you can stay calm, while all around you is chaos...then you probably haven't completely understood the seriousness of the situation.
 
Just thought I'd follow up this post with a WARNING the above API call does not check the contents of the file, it merely tells you the filetype associated with the extension.

We discovered this during our unit testing, giving the API call a GIF file renamed as a word doc, and the above incorrectly verifies the doc as a Word doc, not good at all.

So if anyone actually knows a way to really tell the file type please so share. I'm continuing my efforts to find something and will share what I find.

"If you can stay calm, while all around you is chaos...then you probably haven't completely understood the seriousness of the situation.
 
File type in a Windows context is essentially based on the extension of the filename and what application that extension is associated with. This results in the Windows shell knowing how to "describe" the file in Explorer, and what application to invoke to act upon the file when a filename of that type is, say "double-clicked". As has been observed in this thread, this does not provide a guaranteed result in what "real" type the file actually is.

As far as I can tell the only way to convincingly know what real type a file is requires reading of the file contents. However, without having a complete range of file internal formats or signatures to verify the internal contents against, you cannot be certain of the real type. At this stage, Windows does not provide any sort of enumeration of file signatures or internal formats facility that could be used.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top