Does anyone know a simple way to crawl a website to extract text data, with the URL changing by one interval for each page checked, as in a range? For example - website.com/1000 - website.com/1999. Thanks in advance.
The crawlers that I know can crawl and duplicate a site on your hard drive. But I don't know of any that can extract the text , or crawl by "intervals" as you call them.. ( What programmer is going to anticipate that you have a site with pages named 1000-1999? )
This would involve web scripting.
At the scripting level, a number of requirements would be the following.
Parsing script on a loaded web page.
A search routine on the URL's text for the subpage name with a numerical component in it.
I think virtual frames would be required to open the website without displaying it on the screen or browser, too. The process is not all that simple. You may want to consider using a web scripting language for this. JavaScript, PHP and Python are some examples of web scripting languages. It is better to keep web scraping scripts in separate within a website's script.
Status
Not open for further replies.
You have insufficient privileges to reply here.
Related Threads
?
?
?
?
?
Tech Support Guy
9.9M posts
859.7K members
Since 1998
A forum community dedicated to tech experts and enthusiasts. Come join the discussion about articles, computer security, Mac, Microsoft, Linux, hardware, networking, gaming, reviews, accessories, and more!