1. Computer problem? Tech Support Guy is completely free -- paid for by advertisers and donations. Click here to join today! If you're new to Tech Support Guy, we highly recommend that you visit our Guide for New Members.

Extracting hyperlinks from HTMLfiles without Word/Excel

Discussion in 'Software Development' started by JuniperGreen, Jan 24, 2013.

Thread Status:
Not open for further replies.
Advertisement
  1. JuniperGreen

    JuniperGreen Thread Starter

    Joined:
    Jan 3, 2013
    Messages:
    25
    To pull off hyperlinks I usually rely on VBScript using Word or Excel or even Open Office using its own Basic. However, I need to be able to extract and list the full hyperlink address and the accompanying "Display Text" of all the hyperlinks in quite a number of HTML files in a situation where recourse to the forementioned programs will not be possible. As the full link address is not visable on the page (only the Display Text) it would be a long laborious job for someone to go through each file as this would call for some knowledge of HTML given they would have to examine the source code. I doubt if going down the route of "Edit Hyperlink" via an editor would even be available to them.

    To cut a long story short does anyone know of a VBScript which will do the job?. I have scoured the net and the forum and I can't see anything which addresses this. I would appreciate anyone's help.

    Thank you
     
  2. foxidrive

    foxidrive Banned

    Joined:
    Oct 20, 2012
    Messages:
    793
    Maybe a batch file can help. Would you be able to post a sample HTML page in a zip attachment to your reply?
     
  3. JuniperGreen

    JuniperGreen Thread Starter

    Joined:
    Jan 3, 2013
    Messages:
    25
    Zipped (7-Zip) sample file attached. Not very straightforward I'm afraid - many of the references are file/// but the ones I am really interested in are those with the full address ie http://etc. While I was looking for something in VBS, if you have a batch file which can do this I would be more than happy!

    Appreciate your assistance!

    Thank you
     

    Attached Files:

    • Skye.7z
      File size:
      6.5 KB
      Views:
      16
  4. foxidrive

    foxidrive Banned

    Joined:
    Oct 20, 2012
    Messages:
    793
  5. JuniperGreen

    JuniperGreen Thread Starter

    Joined:
    Jan 3, 2013
    Messages:
    25
    It would be all of them really so it could be decided which might prove useful. This looks very promising so far but as well as the full URL I would also need the accompanying "Display Text". That might be the tricky bit - does your tool also extract this info? namely

    Capture the spirit of brave http://www.skye.co.uk/top-tips.php?id=37

    Fingers crossed!

    Thank you
     
  6. foxidrive

    foxidrive Banned

    Joined:
    Oct 20, 2012
    Messages:
    793
    Not really any good I'm afraid.

    This batch file that uses the free GNUSed to download provides the text below it.
    It's heavily dependent on the HTML format so maybe VBS would be a better tool to use - but again the format of each HTML file is the killer.

    Code:
    @echo off
    sed -n "s/.*\(http:\/\/.*\)\x22 target.*adverttitle..\(.*\)<\/div>.*/\2 - \1/p"  skye.htm
    pause
    Code:
    GOT HOLIDAYS TO TAKE BY APRIL? - http://www.skye.co.uk/special-offers.php
    CAPTURE THE SPIRIT OF BRAVE - http://www.skye.co.uk/top-tips.php?id=37
    BECAUSE WE KNOW MONEY DOESN'T GROW ON TREES - http://www.skye.co.uk/special-offers.php
    SPRINGWATCH TOP TIPS  - http://www.skye.co.uk/top-tips.php?id=17
    YEAR OF NATURAL SCOTLAND 2013 - http://www.skye.co.uk/great-outdoors.php
    TAKE SKYE.CO.UK WITH YOU 24/7, ON YOUR SMARTPHONE - http://www.skye.co.uk/whats-on.php
    GET ALL AT SEA ON SKYE IN 2013 - http://www.skye.co.uk/top-tips.php?id=30
     
  7. JuniperGreen

    JuniperGreen Thread Starter

    Joined:
    Jan 3, 2013
    Messages:
    25
    You certainly got the output great but I can appreciate that you had to know what the HTML source coding was to enable you to use your batch file. I didn't appreciate how complicated this could be and the fact that the source coding might well be different for each, and even within each, HTML file. I'm kind of reconciled now that this is impossible without the use of Word/ Excel or Open Office. C'est la vie!

    However, out of interest, using their "Edit Hyperlink" dialogue I have just looked at the actual file in Word. It identifies the URL address correctly but the Display Text field merely offers "<<shown in document>>". On the other hand LibreOffice ( latest incarnation of Open Office) gives the correct URL address and also the correct text "Capture the spirit of brave". This shows that it can be done although how is the big question!

    Anyway thank you for the interest you have shown in my problem - I appreciate that very much!
     
  8. nikomaster

    nikomaster

    Joined:
    Jan 15, 2013
    Messages:
    87
    You have to play with a regular expression engine. Not sure whether VBScript has that. OR an HTML parser.
     
  9. Sponsor

As Seen On
As Seen On...

Welcome to Tech Support Guy!

Are you looking for the solution to your computer problem? Join our site today to ask your question. This site is completely free -- paid for by advertisers and donations.

If you're not already familiar with forums, watch our Welcome Guide to get started.

Join over 733,556 other people just like you!

Thread Status:
Not open for further replies.

Short URL to this thread: https://techguy.org/1086646

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice