Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

VFPxWorkbookXLSX takes some time to open XLSX with around 1800 rows 1

Status
Not open for further replies.

Rajesh Karunakaran

Programmer
Sep 29, 2016
549
MU
Hi friends,

I am reading an XLSX file using VFPxWorkbookXLSX class. A file with around 1800 rows took around 100 seconds to open.
Means, I checked time before and after the
Code:
lnWb = loExcel.OpenXlsxWorkbook(mySourceFile)
call

However, after that it reads the rows in lightning spead.
I am not sure if the XLSX itself has some problem!
(This is the same file which I was discussing in my earlier post on this XLSX class but for another issue. Those who have gone through must have understood already)

Does the class load full content of the file onto memory before setting it ready for further processing?
I mean, is this behavior known and the reason?

Rajesh

 
Hi Rajesh,

100 seconds is a long time. As you suspect this may be because a problem of the XLSX.
It takes some time to unzip the XLSX before getting at the single files in it, but that should not take 100 seconds.

To see what takes so long I'd use logging of execution time with coverage profiling. Mainly just SET COVERAGE TO somefile.log right before calling OpenXlsxWorkbook() and SET COVERAGE TO after it. Then you get a log of every line of code executed with it's time.

VFPs coverage profiler can read in such a csv file and has two modes, code coverage (where the name comes from) that marks which lines are executed and the profiler mode, which displays total and average excution time. As you're not the developer of the class, it'll be less easy to find the bottleneck for you, but if you provide such a log file here, I guess Greg could use it, you can also read the CSV into a table or cursor and make group by queries summing the execution time and telling you which line or lines contribute to the 100s the most. It's likely no single code line takes that long (except the OpenXlsxWorkbook line, but there likely is a loop that needs way too many iterations because of bugs in one of the XML files withing the XLSX file.

Chriss
 
Hi Mike,

Yes, I have QuickHeal installed and that could possibly cause this.
Let me check that as well.

Thanks,
Rajesh
 
Mike,

I disabled my virus scanner and tried this. It didn't make any difference.

Rajesh
 
The time could also be spent trying to resolve schemaLocations given in the xlsx document.

I'd try to load the xlsx in Excel and save under a new file name, then see if Excel mends the internal xml files, ie if that xlsx can be read in fast.

Still a coverage log will surely point out where the time is spent.

Chriss
 
Chriss,

I opened the XLSX in MS Excel and saved onto another filename.
Unfortunately, the new file also shows the same problem.

By the way, when I added
Code:
loExcel      = CREATEOBJECT("VFPxWorkbookXLSX", "VFPxWorkbookXLSX.vcx")

SET COVERAGE TO emailin.log
lnWb         = loExcel.OpenXlsxWorkbook(pSourceFile)
SET COVERAGE TO

loSheets     = loExcel.GetWorkbookSheets(lnWb)
lnSheetCount = loSheets.Count

it created emailin.log of 306,100 KB !

Rajesh
 
Yes, it's not rare. When something takes so long, it also usually means a long log file. It mustn't be that much lines of code, but a loop can be iterated thousands or hundreds of thousand times, and each execution creates a line in the log file, the log is not summarized.

Therefore the analysis has to group by same method and line no and sum the exec times to verify the execution time. It's very likely you find such a few lines, say 10-20 lines of an iteration being repeated. Each of them may not take more than just a ms or lower, but they can sum up to the 100 seconds, and when you know the location in code you can set a break point and debug the behavior, see what's done, how there are so many iterations. Look at actual data processed perhaps wrongly.

Or another variant is some lines really just execute for a very long time, waiting for a result, ie blocking everything else to go on, until they reach a timeout. Again, you or Greg can find out.

Of course you don't skim through the log line by line, this is data analysis time. Pay attention to the CSV description of the log file in the help: I also already mentioned the coverage profiler tool in VFP.

Chriss
 
Chriss,

I had created a XLSX with only 4 rows. I just copied those rows to more rows to make the file to around 1900 rows.
I tried with that file. That takes only around 14 seconds to finish the 'OpenXlsxWorkbook' method.

Rajesh
 
Good, that points out it's an issue with the data.

The answer were and why your other XLSX file needs 100 second still is in the coverage log.

Of course the number of iterations or whatever causes it to take long in one case and not the other will need analysis, it's the code in conjunction with the data, that leads to likely unnecessary often looping or any other mishap that causes the overall time.

The length of the log files, especially now in the new case that runs faster, could help find a difference. 14 seconds still isn't that good, but it's not as bad.

Chriss
 
Perhaps one more thought to make clear what I think:

In such cases where some files work nice, some okayish and some bad, it's obvioulsy a combination of data and code.
Loggin itself won't reveal a single line of code to fix, there isn't an error thrown, some loops just take long as data may be something the developer never expected to encounter.

Greg surely tested a few XLSX files, likely those coming from Excel itself and not some library.

What I suspect is that after analysis it will turn out what to avoid when creating the xml files that compose the xlsx.

One more thing you should do for comparison is create a file with Excel natively from scratch and get a reference how long it takes to read in.
Another thing you said in detail is, that after the load time you can get at single nodes very fast. Well, the one may have to do with the other, ie the fast processing of nodes is the result of a slow preprocessing.

To be clear also, I'm not at analyzing this, don't know if Greg still follows this thread, but one thing that could play a role too is your specific setup hardware, network, any bottleneck.

Chriss
 
Just coming back to this thread -- I do not check everyday...

The class when reading the sheet.xml files have to parse the rows and then the columns to finally retrieve the cell values. I first tried to use MSXML.DOM but found it to be slow; then I tried using string functions in VFP -- it was better but still slow. I finally created a class that is referred to as FastStrings (also available on GitHub site with the workbook class) and it uses SYS(2600) function calls to handle the strings with pointers. This last method turned out to be the fastest in my testing (both for small and large spreadsheets). When I parse the xml file for the cell values, I have to determine the data type. For a string value this is readily available as an attribute with the cell; however, for the other data types, I have to determine based on the format being specified. So, this determination of the data type can be a bit time consuming.

Hope the above can shed light on what takes a long time.
 
Dear Greg,

Yes, it's clear and understood.

Now, before saying anything, a big thanks from my side also, for gifting our community with such a fantastic library!

Yes, I also spent time to look inside the class by adding a SET STEP ON just before it starts opening the file. Yes, I saw that you use SYS(2600) in the process. I found that, with each XML file, you're advancing through its content, character by character looking for Tag expressions, Data type identifiers etc., and then comparing it with the expression it just found earlier, if any. I didn't go more deeper.

Anyway, I have a problem in a particular project but I have been using the class for many other projects successfully.
Thank you once again!

Rajesh


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top