Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to import plain text from Word doc to SQL 1

Status
Not open for further replies.

bmacbmac

IS-IT--Management
Jan 26, 2006
392
US
Hello... I have about 25,000 Word Docs where I would like to import just the text data from each file. Reading other forums, I have tried:

Code:
create table Documents
(DocName nvarchar(40), 
DocContent Varbinary(max))


INSERT INTO Documents
SELECT N'D:\Brian\CW\Impact_Update_02162022\AllLGLToProcess\40273576.doc', BulkColumn
FROM OPENROWSET(BULK  N'D:\Brian\CW\Impact_Update_02162022\AllLGLToProcess\40273576.doc', SINGLE_BLOB) blob
But my result is just a long binary string.

I tried extracting text from the binary:

Code:
select convert(varchar(max), convert(varbinary(max),doccontent)) FROM documents

But I just end up with this:
ÐÏࡱá


Any idea how I can import the plain text from each of these Word docs into an SQL table?

 
I would just write a simple loop in VBA:
[ul]
[li]Start Word[/li]
[li]Open Word Document[/li]
[li]Save As a text file[/li]
[li]Close Word Doc[/li]
[li]Repeat above 3 steps 25,000 times[/li]
[li]Close Word[/li]
[/ul]You would end up with 25,000 text files [wiggle]

---- Andy

"Hmm...they have the internet on computers now"--Homer Simpson
 
I can't suppress the imagination someone ordered data in doc files and the programmer exporting that data thought:
"Why doc files? Okay, I'll program a VBA macro to convert all the 250000 CSV text files into docs, they'll have their reasons."

Your code example points out the goal to store the files themselves, not their text content as csv imports. That's possible and today should be done using FileStream and FileTables in SQL Server, available since a few versions. Nevertheless, it needs some preparations on the database administration level to enable Filestreams and FileTables. Also, FileTables have a defined structure not determined by you, they are not simply a Varbinary(Max) field.

So, if you really just want to store the doc files, then this would help you:

There you find your OPENROWSET reference, but before all that works, you need to administer your database to enable it, as is described in lengths here, for example

And then you don't define a file table with a BLOB or VARBINARY(MAX) field, as the article also says: a) FileStream isn't just a new data type that's a more complex BLOB type, it's a new feature/service of the database server and b) FileTables are a new specific table type to store files or a directory of files.

Then you also have further options, like RBS (remote blob storage) - see a later newer feature.

In the end, it still could be much better to extract the texts out of the docs, if they actually, as Andy and I also imagine resemble CSV (commma separated values) data, then that can be read into tables and you can make better use of the data in queries, than just fetching doc files from the database.

Chriss
 
Chriss,
Unless I've missed something (which is quiet possible), bmacbmac never mentioned anything about CSV, but 'just the text data from each file' and 'import the plain text'.

---- Andy

"Hmm...they have the internet on computers now"--Homer Simpson
 
Hello Andy,

oh, okay, I could have missed that point. But indeed OPENROWSET() is one of the possibilities to query the file system and insert them into a FileTable.
The case of simple text extraction is also just a special case of CSV with only one row and value - a text. But wouldn't Varchar(MAX) or Nvarchar(MAX) be more to the point to store texts than VarBinary(MAX), as bmacbmac is using?
Edit: I looked it up, and varbinary will also work for fulltext indexing. I think I remember, that this can then be used to work on many more encodings, as binary of course allows anything, but you'd be better off with actual unicocde in nvarchar(max), wouldn't you? If you still think about the aspect of taking advantage of having these texts in SQL Server for more than just the storage/backup cycle and fetching through a database connection instead of needing a UNC path and LAN access.

Anyway, it can't hurt to have all the options on the table.

Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top