Live Chat & Podcast at 1:00PM Eastern on Sunday!
There's no such thing as a stupid question, but they're the easiest to answer.
JoinTour
Login
Search
All Other Software
Tag Cloud
access acer asus bios bsod crash desktop driver drivers error ethernet excel freeze games gaming hard drive hardware hdmi internet laptop malware memory missing monitor motherboard network printer problem ram random registry router slow software sound trojan ubuntu 11.10 uninstall usb video virus vista wifi windows windows 7 windows 7 32 bit windows 7 64 bit windows xp wireless xbox
Search
Search for:
Tech Support Guy Forums > Software & Hardware > All Other Software >
Extracting and filtering text from a big list

Reply  
Thread Tools
skytech's Avatar
Computer Specs
Member with 60 posts.
 
Join Date: Nov 2009
Experience: Intermediate
17-Nov-2009, 11:15 AM #1
Post Extracting and filtering text from a big list
Hello,

I got a problem with extracting specific phrases within a list of 500+ rows. All the rows have the similar pattern like the 6 rows below:

row1> We sell big and small blue widgets at http://www.bluewidgetsdomain.com/
row2> Our website is http://www.bluewidgetsdomain.com/
row3> We sell many kinds of widgets. Go to this site for green widgets at http://www.green-widgets-domain.net/
row4> Our website is http://www.green-widgets-domain.net/
row5> We sell widgets. Check out red widgets at http://www.red-widgets-domain.org/
row6> Our website is http://www.red-widgets-domain.org/

Qn 1) How can I extract the words bluewidgetsdomain, green-widgets-domain, red-widgets-domain from each row and delete the rest of the words

Qn 2) For the rows that have the phrase [widgets at], I want to extract all the words after [widgets at] so I can get a list of the domain names, how can I do it?

Qn 3) I want to extract all domains with ending with .com only. (example, in this example the http://www.bluewidgetsdomain.com/ will be extracted)

Qn 4) I want to extract the words between [We sell] and [at]. (example, for row one, the extracted words will be [big and small blue widgets], for row 3 the extracted words will be [many kinds of widgets. Go to this site for green widgets], for row 5 the extracted words will be [widgets. Check out red widgets] )

Qn 5) If the domain have dashes, I want to remove the dashes. (example, http://www.green-widgets-domain.net/ will become http://www.greenwidgetsdomain.net/)

Qn 6) I want to remove all the slash at the end of the domains. (example, http://www.green-widgets-domain.net/ will become http://www.green-widgets-domain.net)

Qn 7) How do I delete all rows that start with [Our website]

I appreciate any help. Thanks in advance!
midders's Avatar
Account Closed with 654 posts.
 
Join Date: Dec 1969
17-Nov-2009, 11:54 AM #2
The (originally unix) tool sed will make quick work of your text file, but its syntax is not exactly user friendly. You can download sed for windows from here. The link also includes links to various tutorials.

Sample answer to Q2:
type inputfile.txt | sed "s/.* widgets at \(.*\)$/\1/" >output.txt

Good luck

midders
stantley's Avatar
Computer Specs
Distinguished Member with 6,738 posts.
 
Join Date: May 2005
Location: Pittsburgh PA USA
Experience: ,The Jimi Hendrix
17-Nov-2009, 12:17 PM #3
You could do that with Notepad++, a free Notepad replacement.

It has a more powerful Search and Replace utility which should handle most of the flavors of replacements you need to do.
Reply

THIS THREAD HAS EXPIRED.
Are you having the same problem? We have volunteers ready to answer your question, but first you'll have to join for free. Need help getting started? Check out our Welcome Guide.

Search Tech Support Guy

Find the solution to your
computer problem!




Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
WELCOME TO TECH SUPPORT GUY! Are you looking for the solution to your computer problem? Join our site today to ask your question -- for free! Our site is run completely by volunteers who want to help you solve your computer problems. See our Welcome Guide to get started.
Thread Tools



Facebook Facebook Twitter Twitter TechGuy.tv TechGuy.tv Mobile TSG Mobile
You Are Using:
Server ID
Advertisements do not imply our endorsement of that product or service.
All times are GMT -4. The time now is 03:22 AM.
Copyright © 1996 - 2011 TechGuy, Inc. All rights reserved.

Powered by Cermak Technologies, Inc.