Quote:
Originally Posted by Squashman Where could I put in an echo statement to see what file it is working on at the moment? Alot of the files I work with are really big. Sometimes millions of records. Was hoping I could see the progression of files so I know how far it is along. |
You can add an Echo command right after the
DO portion of the statement in the for loop if you want to see the For variable as it progresses. Just add
@Echo %%X &&
after the DO part (be sure to leave a space after the
DO)
Quote:
Originally Posted by Squashman Houston we have a problem.
I ran Outcaste's batch file on some data I have here at work and it took an eternity to run. Roughly about 3 hours for it to run. It then didn't output all the records from the input files. I should have had roughly 572,458 lines but I only ended up with 409,667. Not sure how I can troubleshoot this.
The other weird thing is that the software I use to view large files is having a heck of a time handling the output file. It takes forever to open it and this software is designed to open large files. It chunks them into smaller sections and shows you one chunk at a time. It was taking about 15 seconds to go from chunk to chunk when it should only take about 2.
I ran the data thru my script with the unix utilites and it doesn't seem to have any problems handling the output file from that and I also got all the output records.
Outcaste what can we do to debug your batch file? I unfortunately can't send you our customer data to test your batch file with so we are going to have to do it all on my end. Still hoping I can use your batch file or Devil's to do this.
I am going to test Devil's batch file next. Will let you know how that one comes out. |
I was going to say if the files are large, or there are a large number of them, Devil's code will be much more efficient. Mine is so slow because it basically reads each file twice: once to generate a numbered list of all records, then a second time to remove the header line. I started that way to cover a more generic situation where the header line may be repeated every XXX number of records, such as a file ready to print with the header on each page. I just modified that to exclude the line that is numbered "one" -- should have just skipped the first line and not used the Find and findstr statements.
Plus I read the entire first file just to get the header line (once with find, then with findstr), then read it again to actually process the file, so it was getting read 4 times.
I can change those, but then the only difference between devil's file and mine are our choice of variable names.
The missing lines are a typo in my file -- There is a missing \
]\ in the 3rd line from the end:
Code:
For /F "usebackq skip=2 tokens=1* delims=]" %%B In (`Find /V /N "" "%%A" ^|Findstr /I /V /B /C:"[1"`) Do @Echo %%C:"%%A">>%temp%\_f{0}
should be
For /F "usebackq skip=2 tokens=1* delims=]" %%B In (`Find /V /N "" "%%A" ^|Findstr /I /V /B /C:"[1]"`) Do @Echo %%C:"%%A">>%temp%\_f{0} The editor makes it look like there is a space between the ] and the " because I colored it red -- there is no space.
I was using find to number lines (adds
[number] to the start of each line), then findstr to exclude line 1; with out the last bracket, it excludes every line that starts with 1, so lines 10-19, 100-199, 1000-1999, etc were dropped, which would account for 111,110 lines out of the 162,791 missing lines. Not sure about the rest of the missing lines.
I'd specifically created files with 20 lines to check that, but never noticed the change in the output file when the
] got dropped. You could add that in and see if you get all the lines, but it will take even longer.
This will be much more efficient, which is the same way devil's file processes the records:
Code:
For /F "usebackq skip=1 delims=" %%B In ("%%A") Do Echo.%%B:"%%A">>%temp%\_f{0} Quote:
Originally Posted by Squashman Well here is some more results. Devil's batch file took about 32 minutes to run thru those 500,000 lines.
Then I ran my script. I put time stamps into a log file when each batch file started and stopped.
Devil's
Mon 05/05/2008 21:07:53.31
Mon 05/05/2008 21:39:11.28
Squashman's
Mon 05/05/2008 21:51:33.20
Mon 05/05/2008 21:51:43.56
I really can't explain why mine only takes 10 seconds. It is beyond my comprehension.
I ran it with the data on the Network drive vs my hard drive and it took about 4 minutes. |
A Command Prompt (aka DOS) is going to be much slower. It was never really meant to deal with the
contents of files, just the files themselves. The batch file commands have to be interpreted, and the external commands like find have to be called and passed parameters, whereas SED will have machine language routines to do it's manipulation all internally, which can easily be hundreds of times faster as you can see.
I'm also just guessing that your software may be taking so long with the output file because DOS uses Carriage Return/LineFeed (CR/LF) to end lines. Most *nix systems just use LF. Find and For will read in lines terminated with just LF, but when the filename is added to each record, and the line written to the combined output file, each line will end with CR/LF instead of just the LF. If your software has to convert the CR/LF to just LF before displaying each chunk, it will slow it considerably.
If you can hard code the header line in the batch file, devil's code above or the one I show below will be about the fastest you can get in a batch script.
Devil's method of reading the header only takes about 0.55 to 0.60 seconds for a 700,000 line (avg 88 char/line) file on my system. That shouldn't change on a per file basis, so hard coding the header would only shave about one minute off the time to process about 100 large files
A visual basic script
might be a bit faster, but I'm not at all proficient writing those.
Need to pick either the Red or the Blue lines depending on if you want to process only the one extension, or all files except the batch.
Code:
@Echo off
::Set Output file name here
Set _f{1}=Combined.txt
If EXIST "%_f{1}%" Del "%_f{1}%"
::Output Header to temp file
>%temp%\_f{0} Echo."Name":"Street Address":"City":"St":"Zip":"Filename"
::Read lines from each file excluding the batch file and excluding the header line
::Output to temp file adding :"filename" to end of line
::This line processes every file in the folder except this batch file
For /F "tokens=*" %%A In ('dir /b /a-d /o:n ^|Find /I /V "%~nx0"') Do (
For /F "usebackq skip=1 delims=" %%B In ("%%A") Do Echo.%%B:"%%A">>%temp%\_f{0}
This line processes only files with a .CHR extension
For /F "tokens=*" %%A In ('dir /b /a-d /o:n "*.chr"') Do (
For /F "usebackq skip=1 delims=" %%B In ("%%A") Do Echo.%%B:"%%A">>%temp%\_f{0}
)
Move %temp%\_f{0} "%_f{1}%"
For /L %%A In (0,1,1) Do Set _t%%A= I'm running a test with this using the "more efficient" line shown above with a sample file with 600,000 lines to see how long it takes.
Will then try this script to see the difference by hard coding the header.
Then will try devil's file
Running on a 3.0 GHz Pentium D
Jerry