Extracting from a simple directory

Hello,

Lets start learning data scraping with an easy example.

As my hubby is an ortho surgeon, so I decided lets extract a list of orthopaedic surgeons. Here is the directory page: Trauma Directory that we will extract in this example.

Here, below you see how the visual web ripper screen would look like. I am using the latest version i.e. 2.126.12

Visual Web Ripper

In the address bar, you enter the above url of the trauma directory. And the page would load, you can see in the picture above.

Now on this page, you will see categories of Trauma surgeons is listed. So we need to open all these categories one by one. For this looping thing, we create or use the “Templates” feature.

As you see in the following pic:

Trauma categories

Click on the category “Acute medicine”, and click “templates” tab, in it click “new” and from drop down select “link area” and from options “check” on create list option, click ok and save it. And you will see above screen.

Next click on “open” written against this template you have created.  Now this will open the following window, see picture below:

Doctors list

 

Here, we need to save all these docs, so once again we’ll use templates feature but this time from the drop down menu select “page area”. As these listings are not clickable links or a url, this just just text. Similarly, as above, click on first doctors details, and in templates tab click “new” and create this list.

Next you need to click on “open” button next to this template created.

You will come on to the next window, which would look as follows:

Saving doctor details

Here, just click on the data one by one for the first set of data i.e. first doctor only. As we have created a list template so it will save all docs info itself.

So click on first doctors name, it will get selected. In “content” tab, click “new”, save his name as text. Similarly save all the info like this, by creating elements for all of them one by one.

Heres the tricky bit. Doctor name and his speciality is displayed together. So you need to pick the name and speciality from it, one by one by using the feature “content transformation”. Don’t worry, use regex coding, even if you are new to it you will get the hang of it as its very simple to learn.

To get the name only, you will use the following regex code. See picture below:

Content transformation

Thats it. Enjoy extracting. Do not hesitate to write to me if something doesn’t work.

0 Comments

·

Leave a Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s