Extracting hyperlinks from HTMLfiles without Word/Excel

Status
This thread has been Locked and is not open to further replies. Please start a New Thread if you're having a similar issue. View our Welcome Guide to learn how to use this site.

JuniperGreen

Thread Starter
Joined
Jan 3, 2013
Messages
25
To pull off hyperlinks I usually rely on VBScript using Word or Excel or even Open Office using its own Basic. However, I need to be able to extract and list the full hyperlink address and the accompanying "Display Text" of all the hyperlinks in quite a number of HTML files in a situation where recourse to the forementioned programs will not be possible. As the full link address is not visable on the page (only the Display Text) it would be a long laborious job for someone to go through each file as this would call for some knowledge of HTML given they would have to examine the source code. I doubt if going down the route of "Edit Hyperlink" via an editor would even be available to them.

To cut a long story short does anyone know of a VBScript which will do the job?. I have scoured the net and the forum and I can't see anything which addresses this. I would appreciate anyone's help.

Thank you
 

foxidrive

Banned
Joined
Oct 20, 2012
Messages
793
Maybe a batch file can help. Would you be able to post a sample HTML page in a zip attachment to your reply?
 

JuniperGreen

Thread Starter
Joined
Jan 3, 2013
Messages
25
Zipped (7-Zip) sample file attached. Not very straightforward I'm afraid - many of the references are file/// but the ones I am really interested in are those with the full address ie http://etc. While I was looking for something in VBS, if you have a batch file which can do this I would be more than happy!

Appreciate your assistance!

Thank you
 

Attachments

JuniperGreen

Thread Starter
Joined
Jan 3, 2013
Messages
25
It would be all of them really so it could be decided which might prove useful. This looks very promising so far but as well as the full URL I would also need the accompanying "Display Text". That might be the tricky bit - does your tool also extract this info? namely

Capture the spirit of brave http://www.skye.co.uk/top-tips.php?id=37

Fingers crossed!

Thank you
 

foxidrive

Banned
Joined
Oct 20, 2012
Messages
793
Not really any good I'm afraid.

This batch file that uses the free GNUSed to download provides the text below it.
It's heavily dependent on the HTML format so maybe VBS would be a better tool to use - but again the format of each HTML file is the killer.

Code:
@echo off
sed -n "s/.*\(http:\/\/.*\)\x22 target.*adverttitle..\(.*\)<\/div>.*/\2 - \1/p"  skye.htm
pause
Code:
GOT HOLIDAYS TO TAKE BY APRIL? - http://www.skye.co.uk/special-offers.php
CAPTURE THE SPIRIT OF BRAVE - http://www.skye.co.uk/top-tips.php?id=37
BECAUSE WE KNOW MONEY DOESN'T GROW ON TREES - http://www.skye.co.uk/special-offers.php
SPRINGWATCH TOP TIPS  - http://www.skye.co.uk/top-tips.php?id=17
YEAR OF NATURAL SCOTLAND 2013 - http://www.skye.co.uk/great-outdoors.php
TAKE SKYE.CO.UK WITH YOU 24/7, ON YOUR SMARTPHONE - http://www.skye.co.uk/whats-on.php
GET ALL AT SEA ON SKYE IN 2013 - http://www.skye.co.uk/top-tips.php?id=30
 

JuniperGreen

Thread Starter
Joined
Jan 3, 2013
Messages
25
You certainly got the output great but I can appreciate that you had to know what the HTML source coding was to enable you to use your batch file. I didn't appreciate how complicated this could be and the fact that the source coding might well be different for each, and even within each, HTML file. I'm kind of reconciled now that this is impossible without the use of Word/ Excel or Open Office. C'est la vie!

However, out of interest, using their "Edit Hyperlink" dialogue I have just looked at the actual file in Word. It identifies the URL address correctly but the Display Text field merely offers "<<shown in document>>". On the other hand LibreOffice ( latest incarnation of Open Office) gives the correct URL address and also the correct text "Capture the spirit of brave". This shows that it can be done although how is the big question!

Anyway thank you for the interest you have shown in my problem - I appreciate that very much!
 
Joined
Jan 15, 2013
Messages
87
You have to play with a regular expression engine. Not sure whether VBScript has that. OR an HTML parser.
 
Status
This thread has been Locked and is not open to further replies. Please start a New Thread if you're having a similar issue. View our Welcome Guide to learn how to use this site.

Users Who Are Viewing This Thread (Users: 0, Guests: 1)

As Seen On
As Seen On...

Welcome to Tech Support Guy!

Are you looking for the solution to your computer problem? Join our site today to ask your question. This site is completely free -- paid for by advertisers and donations.

If you're not already familiar with forums, watch our Welcome Guide to get started.

Join over 807,865 other people just like you!

Latest posts

Staff online

Members online

Top