There's no such thing as a stupid question, but they're the easiest to answer.
JoinTour
Login
 
Tag Cloud
access audio avg avg 8 bios blue screen boot browser bsod computer crash css dell desktop driver drivers dvd email error excel explorer firefox firefox 3 freeze gimp graphics hard drive hardware help please hijackthis hjt hjt log install internet internet explorer itunes javascript keyboard laptop log malware monitor network networking openoffice outlook outlook 2003 outlook express password popups problem router seo slow sound sp3 spyware startup trojan usb video virtumonde virus vista vundo windows windows xp winxp wireless youtube
DOS/PDA/Other
Search
Search in:
 
Advanced Search
Tech Support Guy Forums > Operating Systems > DOS/PDA/Other >
Search a file with findstr


HELLO AND WELCOME! Before you can post your question, you'll have to register -- it's completely free! Click here to join today! We highly recommend that you print a copy of our Guide for New Members. Enjoy!

 
Thread Tools
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
17-May-2008, 12:49 AM #16
Another stupid question.
echo.%%a>>%_OF0%

I thought echo. sends a blank line to output. Why do we need the period.
devil_himself's Avatar
Distinguished Member with 4,794 posts.
 
Join Date: Apr 2007
Location: India
Experience: Advanced
17-May-2008, 01:00 AM #17
I use period to avoid the potential "ECHO is off."

check out these two examples

With Period
Code:
 @echo off
  setlocal
  set var=This is a test
  echo.%var%
  set var=
  echo.%var%
  echo Test Over
  endlocal
Without Period

Code:
 @echo off
  setlocal
  set var=This is a test
  echo%var%
  set var=
  echo%var%
  echo Test Over
  endlocal
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
29-May-2008, 11:17 PM #18
Quote:
Originally Posted by TheOutcaste View Post
This one is much faster, as it only does the findstr once per line

Code:
@echo off
Echo.%time%>time1.txt
setlocal enabledelayedexpansion
Set _WL=search.txt
Set _OF0=Profanity.txt
Set _OF1=Not_Profanity.txt
Set _SF=file 1.txt
If EXIST %_OF0% del %_OF0%
If EXIST %_OF1% del %_OF1%
For /f "usebackq tokens=*" %%a In ("%_SF%") Do (
  Set _T0=%%a & echo.%%a|findstr /g:"%_WL%">nul & call :_Output!errorlevel!
  )
Echo.%time%>>time1.txt
Goto:EOF
:_Output0
echo.!_T0!>>%_OF0%
Goto:EOF
:_Output1
echo.!_T0!>>%_OF1%
Goto:EOF
But your original idea using if then else is the fastest:
Code:
@echo off
setlocal enabledelayedexpansion
Set _WL=search.txt
Set _OF0=Profanity.txt
Set _OF1=Not_Profanity.txt
Set _SF=file 1.txt
If EXIST %_OF0% del %_OF0%
If EXIST %_OF1% del %_OF1%
For /f "usebackq tokens=*" %%a In ("%_SF%") Do (
  echo.%%a|findstr /g:"%_WL%">nul
  If !errorlevel!==0 (echo.%%a>>%_OF0%) else echo.%%a>>%_OF1%
  )
On a relative scale, times for the 1st, 2nd, and 3rd versions are:
  1. 1.000
  2. 0.556
  3. 0.500

Not too surprising that the first one takes twice as long, as it does the findstr twice.
Doesn't seem to make any difference if you use the /v switch on findstr.

Jerry
Well, I finally got around to testing it on a rather large file. It has over 2 million records in it. The batch file immediatley chokes on it. I am not sure why.

Quote:
C:\8343>profanity.bat
Not enough storage is available to process this command.
Out of memory.
C:\8343>
Now I know this will run if I use grep. I had originally used a simple one line of code using grep but it would only output the records that had profanity in them. I would then run the grep with a reverse search to give me the non profanity records. Which is what brought me to start this thread. Was there a way to do this natively with a batch file. Apparently it can't handle the large file size or something. My grep batch file that I tested with did over 10 million records. Each record being 900 bytes long.

I got the same error using Devil's code as well.

So now I am back to square one again.
__________________
I hate asking the same question twice!
How to ask questions the smart way!
Microsoft MVP - Windows Shell/User
TheOutcaste's Avatar
Computer Specs
Senior Member with 1,538 posts.
 
Join Date: Aug 2007
Location: Oregon, USA
Experience: Intermediate
03-Jun-2008, 01:08 AM #19
Yeah, I don't think the command prompt was really designed to handle large files. I just did a test with a file that was only 411094 KB in size (a list of addresses) using just two search strings. Watching memory usage for cmd.exe in Task Manger, it jumped to 827 MB while the batch file was running, and it pauses for quite a bit before it actually does the echo|findstr part. I'm guessing it's loading the entire file into memory to work on, so your 9GB + file would choke it. (unless you have say 32 or 64 GB of ram on your system)

Running it now on a 51387KB file, and memory usage is 106804 KB. Seems to be about 2x file size plus 4 KB. When I first open a prompt, it's using about 2480 KB.

And after running on this 51387KB file for an hour, I'm guessing it will take another 22 hours to finish.
That's using this version:
Code:
@echo off
Echo.%time%>time.txt
Set _WL=search1.txt
Set _OF1=Profanity.txt
Set _OF2=Not_Profanity.txt
Set _SF=big 1.txt
If EXIST %_OF1% del %_OF1%
If EXIST %_OF2% del %_OF2%
For /f "usebackq tokens=*" %%a In ("%_SF%") Do (
  echo.%%a|findstr /i /g:"%_WL%">>%_OF1%
  echo.%%a|findstr /i /v /g:"%_WL%" >>%_OF2%)
For %%A In (WL OF1 OF2 SF) Do Set _%%A=
Echo.%time%>>time.txt
The version using the IF THEN ELSE format shows the same memory usage.

Surprisingly, just using 2 findstr statements took 45.5 seconds for the 411094 KB file, and memory usage peaked at 2776 KB. Increased the search strings to 10 and it didn't make much difference.
However, findstr.exe peaked at about 411000 KB. I then created a 4,110,0938 KB file, and findstr.exe peaked at 2288 KB, so it seems it will use RAM if available, if not, it uses a buffer and reads in chunks.

But it seems to have crashed. CPU usage was about 20-25% (40-50% of one core), then usage hit 49% (98% on one core) and while Task Manager still shows I/O reads, I/O writes have stopped, and the output file is not growing. Wasn't paying attention to see when this happened though, checked it about 10 minutes after I started it.
Hit CTRL+C, got prompted to Terminate, chose NO, and it started on the 2nd portion, so it seems it hung after finishing the first findstr statement.
2nd portion finished after about 2-3 minutes, and hung again. CTRL+C, said no, and the batch finished. Not sure why.
Ran findstr with a junk search string just to count the number of records in one of the result files (28,799,994) and it hung when it finished. Had to CTRL+C to get the prompt back. Seems findstr doesn't like large files, so that probably won't be a solution for you either, unless someone has an idea how to stop it from hanging -- course that may be something quirky with my system as well, as it does have issues with long running processes, so you might want to test it on yours.

Much Much faster than a For loop though.

So, try this:
Code:
@echo off
Echo.%time%>time4.txt
Set _WL=search1.txt
Set _OF1=Profanity.txt
Set _OF2=Not_Profanity.txt
Set _SF=big3.txt
If EXIST %_OF1% del %_OF1%
If EXIST %_OF2% del %_OF2%
findstr /i /g:"%_WL%" "%_SF%">>%_OF1%
findstr /i /v /g:"%_WL%" "%_SF%">>%_OF2%
For %%A In (WL OF1 OF2 SF) Do Set _%%A=
Echo.%time%>>time4.txt
Jerry
__________________
Of course I know all the answers ; I just don't always match the answers to the right questions

Warning -- Windows spoken here. (Rated R for Strong Language and Violence -- When your Windows PC flies through a window, that's violent, right?)
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
07-Jun-2008, 01:02 AM #20
That was much faster. Kind of sad that we have to pass the data twice though. Your new script took.
23:46:49.24
23:50:18.27
That was on a 1.88GB file. With 2.5 million records.
Weird thing is that I can't figure out why I got some records in my Profanity output file.

I may have found a work around using SED though. Will test it to see if it is faster than yours.

Why do you have that for statement in your last set of code?
__________________
I hate asking the same question twice!
How to ask questions the smart way!
Microsoft MVP - Windows Shell/User

Last edited by Squashman : 07-Jun-2008 01:12 AM.
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
07-Jun-2008, 01:22 AM #21
Well the SED script isn't working much better.
5 minutes in and it has only processed about 400MB of the file.

I think I will test it using 2 GREPS.
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
07-Jun-2008, 02:27 AM #22
I ran it with a Double grep and the output records amount was the same which is a good thing.
Grep was a bit faster.
0:59:29.08
1:02:35.54
Then Findstr.
1:06:49.53
1:10:19.66

The weird thing about both was that I can't figure out why it chose some records. There is plenty of records that I am looking at and they don't seem to have any profanity in them. I am not sure what they are matching.

My SED script took almost 24 minutes and for some reason the match output came out looking funny. I am not sure what happend but sed tacked on an extra LineFeed at the end. It didn't do it on my smaller test files but for some reason it did it this time.
__________________
I hate asking the same question twice!
How to ask questions the smart way!
Microsoft MVP - Windows Shell/User
TheOutcaste's Avatar
Computer Specs
Senior Member with 1,538 posts.
 
Join Date: Aug 2007
Location: Oregon, USA
Experience: Intermediate
07-Jun-2008, 04:55 PM #23
About 11% faster. *nix wins again
Quote:
Originally Posted by Squashman View Post
Why do you have that for statement in your last set of code?
For %%A In (WL OF1 OF2 SF) Do Set _%%A=

This is just to clean up the variables used. Equivalent to 4 separate Set statements:
Set _WL=
Set _OF1=
Set _OF2=
Set _SF=


Just a habit in case I may want to use the value in a variable in a different batch file. I'd just have to remove that one variable name from the For loop.
This also lets me comment out the For statement and then be able to use set after the batch ends to look at the variable values. You can use setlocal to do the same, but changes to a variable after the setlocal statement are not saved.

Another variation is to use numbered variables like _tX
Then you can use For /L %%A in (0,5,1) do Set _t%%A= to clear them, just adjusting the start and end numbers.

I guessing you have more than 1.88GB of RAM. Still wonder if findstr will hang if you test a file that is larger than your physical RAM.

As for why it would match some records with no obvious words, you might try putting some of the records it found in a file by themselves and do the findstr again just to see if they match again. This will verify that it's not some weird glitch (not likely)
I'd also look closely for typos. For example, if you transpose the f and t in shifting, it would be flagged but might be real hard to spot by eye.
Or put those records in a word processor and search for the words in your list. Will help find matches embedded in the middle of words.

To avoid matching on embedded strings, add a space before or after the search word in the search.txt file (or search1.txt, just noticed I changed the name somewheres along the line).
{space}profanity
profanity
{space}
Can't say I can think of any "proper" words that have an embedded profanity though.

The problem with this is that the first won't match if it's the first word on a line, and the second won't match if it's the last, so some words might be missed. Might be able to use regular expressions with findstr to work around that though.
I've not played with findstr and it's regular expression features much as I haven't found much documentation on it yet. My quick test with a 51MB file shows that just adding the /R switch will make the search take 1.5 times as long. Adding checks for beginning and end of line, or beginning and end of word ups that to nearly 6 times as long. 3.6 seconds with normal search, 5.1 seconds with /R, 18 seconds with checking beginning of line. If you have multiple searches specifying beginning or end positions I suspect it will take even longer.
{space}profanity
profanity
{space}
^profanity (to catch at start of line)
$profanity (to catch at end of line)

If you need to do something like that, might be better to search the Not_Profanity.txt file for just the beginning/ending matches. Would only be a help if the Not_profanity file is much smaller than the original though.

Jerry
__________________
Of course I know all the answers ; I just don't always match the answers to the right questions

Warning -- Windows spoken here. (Rated R for Strong Language and Violence -- When your Windows PC flies through a window, that's violent, right?)
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
07-Jun-2008, 06:05 PM #24
512MB of ram. on a Pentium 4. 2.4Ghz.

We kind of need to match on embeded strings. Some of our clients data has some really bad data because they have a Web sign up form. So I need to make sure I match things like
C
U
N
T
L
Y
Peniston.

Had to get around the forum filters to post that.
__________________
I hate asking the same question twice!
How to ask questions the smart way!
Microsoft MVP - Windows Shell/User
TheOutcaste's Avatar
Computer Specs
Senior Member with 1,538 posts.
 
Join Date: Aug 2007
Location: Oregon, USA
Experience: Intermediate
07-Jun-2008, 06:38 PM #25
Figured it was something like that.

Strangely though, the filters still caught it on the email notice, it came through as ****ly. Unless you edited it quick enough; edits made in the first minute or two don't always flag the post as having been edited.

It's obviously matching on something, just a matter of spotting it to see if it's valid, or something you need to work around.

If you can zip your search file and a few of the records it's found that don't appear obvious I'd be happy to look at them. Fresh pair of eyes might help.

Assuming of course that the records aren't confidential. I can PM an email addy if it's too large to post, or you don't want it public.

Jerry
__________________
Of course I know all the answers ; I just don't always match the answers to the right questions

Warning -- Windows spoken here. (Rated R for Strong Language and Violence -- When your Windows PC flies through a window, that's violent, right?)
TheOutcaste's Avatar
Computer Specs
Senior Member with 1,538 posts.
 
Join Date: Aug 2007
Location: Oregon, USA
Experience: Intermediate
07-Jun-2008, 06:45 PM #26
And seems findstr hanging may be an issue with just my PC. Have to try one with a file just barely over the 2GB size of my RAM.

Jerry
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
07-Jun-2008, 10:31 PM #27
I was thinking of just pulling a few of those records out and running them by themselves or manually searching those records with find/replace in notepad or something. The last couple of records in the file are really suspicious. I can just tail the last 10 records from the file and rerun it. I unfortunately can't send the data to you. But thanks for the offer to help.
__________________
I hate asking the same question twice!
How to ask questions the smart way!
Microsoft MVP - Windows Shell/User
ghostdog74's Avatar
Member with 83 posts.
 
Join Date: Dec 2005
09-Jun-2008, 11:16 AM #28
Quote:
Originally Posted by Squashman View Post

I may have found a work around using SED though. Will test it to see if it is faster than yours.

download GNU grep for windows instead. Then on the command prompt
Code:
c:\test> grep -f profanity file
something like that. read the docs for GNU grep for more info.
Squashman's Avatar
Distinguished Member with 12,328 posts.
 
Join Date: Apr 2003
Location: 1265 Lombardi Ave
09-Jun-2008, 12:33 PM #29
Quote:
Originally Posted by ghostdog74 View Post
download GNU grep for windows instead. Then on the command prompt
Code:
c:\test> grep -f profanity file
something like that. read the docs for GNU grep for more info.
I am glad you read this whole thread.
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are Off
Refbacks are Off

You Are Using:
Server ID
Advertisements do not imply our endorsement of that product or service.
All times are GMT -4. The time now is 05:41 AM.
Copyright © 1996 - 2008 TechGuy, Inc. All rights reserved.
Powered by vBulletin, Copyright © 2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Powered by Cermak Technologies, Inc.