Thursday, 27 November 2014

Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.

readLines

For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")
# Pull out the appropriate line
author_lines <- web_page[grep("<I>", web_page)]
# Delete unwanted characters in the lines we pulled out
authors <- gsub("<I>", "", author_lines, fixed = TRUE)
# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)
author_counts[1:10]
[webscrape results]


We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)
library("RCurl")
# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)
library("XML")
# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)
jan09_parsed <- htmlTreeParse(jan09)
# Continue on similar to above
...

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

Source: http://www.r-bloggers.com/webscraping-using-readlines-and-rcurl-2/

Wednesday, 26 November 2014

Web Data Extraction: driving data your way

Most businesses rely on the web to gather data such as product specifications, pricing information, market trends, competitor information, and regulatory details. More often than not, companies collect this data manually—a process that not only takes a significant amount of time, but also has the potential to introduce costly errors.

By automating data extraction, you're able to free yourself (and your pointer finger) from hours of copy/pasting, eliminate human errors, and focus on the parts of your job that make you feel great.

Web data extraction: What it is, why it's used, and how to get it right on an ongoing basis

Web data extraction, screen scraping, web harvesting—while these terms may have different connotations they all essentially point to the same thing: plucking data from the web and populating it in an organized way in another place for further analysis or more focused use. In an era where “big data” has become a commonplace concept, the appeal of web data extraction has grown; it’s an extremely efficient alternative to web browsing, and culls very specific data for a focused purpose.

How it's used

While each company’s needs vary, data extraction is often used for:

    Competitive intelligence, including web popularity, social perception, other sites linking to them, and placement of competitor advertisements

    Gathering financial data including stock market movement, product pricing, and more

    Creating continuity between price sheets and online websites, catalogs, or inventory databases

    Capturing product specifications like dimensions, color, and materials

    Pulling tabular data from multiple sources for in-depth analysis

Interestingly, some people even find that web data extraction can aid them in their leisure time as well, pulling data from blogs and websites that pertain to their hobbies or interests, and creating their own library of organized information on a topic. Perhaps, for instance, you want a list of all the designers that George Clooney wears (hey- we won’t question what you do in your free time). By using web scraping tools, you could automatically extract this type of data from, say, a fashion blogger who follows celebrity style, and create your own up-to-date shopping list of items.

How it's done

When you think of gathering data from the web, you should mentally juxtapose two different images: one of gathering a bucket of sand one grain at a time, and one of filling a bucket with a shovel that has the perfect scoop size to fill it completely in one sitting. While clearly the second method makes the most sense, the majority of web data extraction happens much like the first process--manually, and slowly.

Let’s take a look at a few different ways organizations extract data today.

The least productive way: manually

While this method is the least efficient, it’s also the most widespread. On the plus side, you need to learn absolutely nothing except “Ctrl+C/V” to use this method, which explains why it is the generally preferred method, despite the hours of time it can take. Imagine, for instance, managing a sales spreadsheet that keeps inventory up to date so that the information can be properly disseminated to a global sales team. Not only does it take a significant amount of time to update the spreadsheet with information from, say, your internal database and manufacturer’s website, but information may change rapidly, leaving sales reps with inaccurate information regardless.

Finding someone in the organization with a talent for programming languages like Python

Generally, automating a task without dedicated automation software requires programming, and therefore an internal resource with a solid familiarity with programming languages to create the task and corresponding script. While most organizations do, in fact, have a resource in IT or engineering with this type of ability, it often doesn’t seem like a worthy time investment for that person to derail the initiatives he or she is working on to automate web data extraction. Additionally, if companies do choose to automate using in-house resources, that person will find himself beholden to a continuing obligation, since he or she will need to adjust scripting if web objects and attributes change, disabling the task.

Outsourcing via Elance or oDesk

Unless there is a dedicated resource ready to automate and maintain data extraction processes (and most organizations wouldn’t necessarily choose to use their in-house employee time this way), companies might turn to outsourcing companies such as Elance or oDesk to hire contract help. While this is an effective way to automate a task using a resource that has a level of acumen in automation, it represents an additional cost--be it one time or on a regular basis as data extraction requirements change or increase.

Using Excel web queries

Since more often than not, data extracted from the web is often populated into an Excel spreadsheet, it’s no wonder that Excel includes web query tools expressly for that purpose. These tools are particularly useful in pulling tabular data from a website (such as product specifications, legal codes, stock prices, and a host of other information) and automatically pushing the data into a spreadsheet. Excel queries do have limitations and a learning curve, however, particularly when creating dynamic web queries. And clearly, if you’re hoping to populate the information in other sources, such as external databases, there is yet another level of difficulty to navigate.

How automation simplifies web data extraction

Culling web data quickly

Using automation is the simplest way to extract web data. As you execute the steps necessary to perform the task one time, a macro recorder captures each action, automatically generates an easily-editable script, and lets you specify how often you would like to repeat the task, and at what speed.

Maintaining the highest level of accuracy

With humans copy/pasting data, or comparing between multiple screens and entering data manually into a spreadsheet, you’re likely to run into accuracy issues (sometimes directly proportionate to the amount of time spent on the task and amount of coffee in the office!) Automation software ensures that “what you see is what you get,” and that data is picked up from the web and put back down where you want it without a hitch.

Storing web data in your preferred format

Not only can you accurately transfer data with automation software, you can also ensure that it’s populated into spreadsheets or databases in the format you prefer. Rather than simply dumping the data into a spreadsheet, you can ensure that the right information is put into the proper column, row, field, and style (think, for instance, of the difference between writing a birth date as “03/13/1912” and “12/3/13”).

Simplifying data analysis

Automation software allows you to aggregate data from disparate sources or enormous stockpiles of structured or unstructured data in a way that makes sense for your business analysis needs. This way, the majority of employees in an organization can perform some level of analysis on their own, making it easier to surface information that informs business decisions.

Reacting to changes without a hitch

Because automation software is built to recognize icons, images, symbols, and other objects regardless of their position on a screen, it can automate processes in a self-perpetuating manner. For example, let’s say you automate data retrieval from a certain chart on a retailer’s website without automation software. If the retailer decides to move that object to another area of the screen, your task would no longer produce accurate results (or work at all), leaving you to make changes to the script (or find someone who can), or re-record the task altogether. With image recognition capabilities, however, the system “memorizes” the object itself, not merely its coordinates, so that the task can continue to run irrespective of changes.

The wide sweeping appeal of automation software

Companies often pick a comprehensive automation solution not only because of its ability to effectively automate any web data extraction task, but also because it goes beyond data extraction. Automation software can permeate into other areas of the business as well, making tasks such as application integration, data migration, IT processes, Excel automation, testing, and routine tasks such as launching applications or formatting files faster and more accurate. Because it requires no programming experience to use, adoption rates are higher and businesses get more “bang for their buck.”

Almost any organization can benefit from using automation software, particularly as they grow and scale. If you are looking to quit “moving grains of sand” and start claiming back time in your day, there are a few steps you can take:

 Watch a short video that shows how web data extraction is done with automation software

 Download a free trial and start reaping the benefits of downloading even just a couple of tasks today.

 See how tasks are automated with our short, step-by-step how-to-sheets (and then give it a try yourself!)

Source: https://www.automationanywhere.com/web-data-extraction

Wednesday, 19 November 2014

Web Scraping for Fun & Profit

There’s a number of ways to retrieve data from a backend system within mobile projects. In an ideal world, everything would have a RESTful JSON API – but often, this isn’t the case.Sometimes, SOAP is the language of the backend. Sometimes, it’s some proprietary protocol which might not even be HTTP-based. Then, there’s scraping.

Retrieving information from web sites as a human is easy. The page communicates information using stylistic elements like headings, tables and lists – this is the communication protocol of the web. Machines retrieve information with a focus on structure rather than style, typically using communication protocols like XML or JSON. Web scraping attempts to bridge this human protocol into a machine-readable format like JSON. This is what we try to achieve with web scraping.

As a means of getting to data, it don’t get much worse than web scraping. Scrapers were often built with Regular Expressions to retrieve the data from the page. Difficult to craft, impossible to maintain, this means of retrieval was far from ideal. The risks are many – even the slightest layout change on a web page can upset scraper code, and break the entire integration. It’s a fragile means for building integrations, but sometimes it’s the only way.

Having built a scraper service recently, the most interesting observation for me is how far we’ve come from these “dark days”. Node.js, and the massive ecosystem of community built modules has done much to change how these scraper services are built.

Effectively Scraping Information

Websites are built on the Document Object Model, or DOM. This is a tree structure, which represents the information on a page.By interpreting the source of a website as a DOM, we can retrieve information much more reliably than using methods like regular expression matching. The most popular method of querying the DOM is using jQuery, which enables us to build powerful and maintainable queries for information. The JSDom Node module allows us to use a DOM-like structure in serverside code.

For purpose of Illustration, we’re going to scrape the blog page of FeedHenry’s website. I’ve built a small code snippet that retrieves the contents of the blog, and translates it into a JSON API. To find the queries I need to run, first I need to look at the HTML of the page. To do this, in Chrome, I right-click the element I’m looking to inspect on the page, and click “Inspect Element”.

Screen Shot 2014-09-30 at 10.44.38

Articles on the FeedHenry blog are a series of ‘div’ elements with the ‘.itemContainer’ class

Searching for a pattern in the HTML to query all blog post elements, we construct the `div.itemContainer` query. In jQuery, we can iterate over these using the .each method:

var posts = [];

$('div.itemContainer').each(function(index, item){

  // Make JSON objects of every post in here, pushing to the posts[] array

});

From there, we pick off the heading, author and post summary using a child selector on the original post, querying the relevant semantic elements:

    Post Title, using jQuery:

    $(item).find('h3').text()trim() // trim, because titles have white space either side

    Post Author, using jQuery:

    $(item).find('.catItemAuthor a').text()

    Post Body, using jQuery:

    $(item).find('p').text()

Adding some JSDom magic to our snippet, and pulling together the above two concept (iterating through posts, and picking off info from each post), we get this snippet:

var request = require('request'),

jsdom = require('jsdom');

jsdom.env(

  "http://www.feedhenry.com/category/blog",

  ["http://code.jquery.com/jquery.js"],

  function (errors, window) {

    var $ = window.$, // Alias jQUery

    posts = [];

    $('div.itemContainer').each(function(index, item){

      item = $(item); // make queryable in JQ

      posts.push({

        heading : item.find('h3').text().trim(),

        author : item.find('.catItemAuthor a').text(),

        teaser : item.find('p').text()

      });

    });

    console.log(posts);

  }

);

A note on building CSS Queries

As with styling web sites with CSS, building effective CSS queries is equally as important when building a scraper. It’s important to build queries that are not too specific, or likely to break when the structure of the page changes. Equally important is to pick a query that is not too general, and likely to select extra data from the page you don’t want to retrieve.

A neat trick for generating the relevant selector statement is to use Chrome’s “CSS Path” feature in the inspector. After finding the element in the inspector panel, right click, and select “Copy CSS Path”. This method is good for individual items, but for picking repeating patterns (like blog posts), this doesn’t work though. Often, the path it gives is much too specific, making for a fragile binding. Any changes to the page’s structure will break the query.

Making a Re-usable Scraping Service

Now that we’ve retrieved information from a web page, and made some JSON, let’s build a reusable API from this. We’re going to make a FeedHenry Blog Scraper service in FeedHenry3. For those of you not familiar with service creation, see this video walkthrough.

We’re going to start by creating a “new mBaaS Service”, rather than selecting one of the off-the-shelf services. To do this, we modify the application.js file of our service to include one route, /blog, which includes our code snippet from earlier:

// just boilerplate scraper setup

var mbaasApi = require('fh-mbaas-api'),

express = require('express'),

mbaasExpress = mbaasApi.mbaasExpress(),

cors = require('cors'),

request = require('request'),

jsdom = require('jsdom');

var app = express();

app.use(cors());

app.use('/sys', mbaasExpress.sys([]));

app.use('/mbaas', mbaasExpress.mbaas);

app.use(mbaasExpress.fhmiddleware());

// Our /blog scraper route

app.get('/blog', function(req, res, next){

  jsdom.env(

    "http://www.feedhenry.com/category/blog",

    ["http://code.jquery.com/jquery.js"],

    function (errors, window) {

      var $ = window.$, // Alias jQUery

      posts = [];

      $('div.itemContainer').each(function(index, item){

        item = $(item); // make queryable in JQ

        posts.push({

          heading : item.find('h3').text().trim(),

          author : item.find('.catItemAuthor a').text(),

          teaser : item.find('p').text()

        });

      });

      return res.json(posts);

    }

  );

});

app.use(mbaasExpress.errorHandler());

var port = process.env.FH_PORT || process.env.VCAP_APP_PORT || 8001;

var server = app.listen(port, function() {});

We’re also going to write some documentation for our service, so we (and other developers) can interact with it using the FeedHenry discovery console. We’re going to modify the README.md file to document what we’ve just done using API Blueprint documentation format:

# FeedHenry Blog Web Scraper

This is a feedhenry blog scraper service. It uses the `JSDom` and `request` modules to retrieve the contents of the FeedHenry developer blog, and parse the content using jQuery.

# Group Scraper API Group

# blog [/blog]

Blog Endpoint

## blog [GET]

Get blog posts endpoint, returns JSON data.

+ Response 200 (application/json)

    + Body

            [{ blog post}, { blog post}, { blog post}]

We can now try out the scraper service in the studio, and see the response:

Scraping – The Ultimate in API Creation?

Now that I’ve described some modern techniques for effectively scraping data from web sites, it’s time for some major caveats. First,  WordPress blogs like ours already have feeds and APIs available to developers - there’s no need to ever scrape any of this content. Web Scraping is not a replacement for an API. It should be used only as a last resort, after every endeavour to discover an API has already been made. Using a web scraper in a commercial setting requires much time set aside to maintain the queries, and an agreement with the source data is being scraped on to alert developers in the event the page changes structure.

With all this in mind, it can be a useful tool to iterate quickly on an integration when waiting for an API, or as a fun hack project.

Source: http://www.feedhenry.com/web-scraping-fun-profit/

Monday, 17 November 2014

Data Scraping Guide for SEO & Analytics

Data scraping can help you a lot in competitive analysis as well as pulling out data from your client’s website like extracting the titles, keywords and content categories.

You can quickly get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site…………and the list goes on…

 Scraping Organic Search Results

By scraping organic search results you can quickly find out your SEO competitors for a particular search term. You can determine the title tags and the keywords they are targeting.

    The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you may see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

    Pro Tip by Tahir Fayyaz:

    Just wanted to add a tip for people using the SERPs Redux bookmarklet.

    If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like:

    Title tag
    Meta description tag
    Meta keywords tag
    Meta robots tag
    H1 tag
    H2 tag
    HTTP Header
    Backlinks
    Facebook likes etc.

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

    Syntax: =importXML(URL,X-path-query)

    url=> URL of the web page from which you want to import the data.

    x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works.

Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather‘ to quickly generate path expressions while browsing HTML and XML documents.

Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

 How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather – Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[1]/a

Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]/li[2]/a

Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:

    /html/body/div[@id='page']/div[@id='page-ext']/div[@id='main']/div[@id='main-ext']/div[@id='mask-3']/div[@id='mask-2']/div[@id='mask-1']/div[@id='primary-content']/div/div/div[@id='post-58']/div/ol[2]//li/a

Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

    Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

    To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

 Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas.  Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Source: http://www.optimizesmart.com/data-scraping-guide-for-seo/

Thursday, 13 November 2014

Interactive Crawls for Scraping AJAX Pages on the Web

Crawling pages on the web has become an everyday affair for most enterprises. Too often do we come across offline businesses as well who’d like data gathered from the web for internal analyses. All this eventually to serve customers faster and better. At times, when the crawl job is high-end cum high-scale, businesses also consider DaaS providers to supplement their efforts.

However, the web landscape too has evolved with newer technologies that provide fancy experiences to web users. AJAX elements are one such common aid that leave even the DaaS providers perplexed. They come in various forms from a user’s point of view-

1. Load more results on the same page

2. Filter results based on various selection criteria

3. Submit forms, etc.

When crawling a non-AJAX page, simple GET requests do the job. However, AJAX pages work with POST requests that are not easy to trace for a normal bot.

Difference between GET request and POST request- Scraping

GET vs. POST

At PromptCloud, from our experience with a number of AJAX sites on the web, we’ve crossed the tech barrier. Below is a quick review about the challenges that come with AJAX crawling and its indicative solutions-

1. Javascript Emulations- A bot essentially emulates human browsing to fetch pages. When this needs to be done for Javascript components on a page, it gets tricky. Headless browser, which emulates human interaction with a web page without an interface, is the current approach. These browsers click on various elements/ dropdown lists that are embedded within Javascript code and capture responses to be transferred to programs. Which headless browser to pick depends on what fits well into your current stack.

2. Fetch Bandwidths- Unlike GET requests which complete pretty quickly, POST requests take quite a bit of time due to the number of events involved per fetch. Hence a good amount of bandwidth needs to be allocated in order to receive the response. For the same reason, wait times need to be taken care of too else you might end up with incomplete responses.

3. .NET Architectures- This is a more complex scenario related to maintaining the View State. Most of the postbacks come with an event and its validation. The bot needs to track the view state and pass validations for the event to occur so that the code can be executed and results captured. This is achieved by adopting a mechanism to restore states if things break midway.

4. Page Encoding- Request and response headers need to be taken care of on AJAX pages. The request needs to be sent in the exact format as expected by the server (Content-type or media type, accept fields, etc.) and similarly responses need to be parsed based on the content-type.

A Use Case

One of our clients who is into sale of event tickets at discounted rates had us crawl one of the ticketing sites on the web weekly; one of the most complex AJAX crawling we’ve dealt with so far. For the data that was to be extracted, multiple AJAX fetches were needed depending on the selections made. Requests had to be made for a combination of items from the dropdown box. These came with cookies and session IDs. To add to the challenge the site was extremely dynamic and changed its structure every week making it difficult for us to follow what data was where on the page.

We developed an AJAX crawler specific to this site to take care of all the dynamics. Response times were taken care of so that we didn’t miss any relevant information. We included an ML component to improve the crawler which is now pretty stable irrespective of changes on the site.

Overall, AJAX crawling requires more compute power in addition to the tech expertise. And because there’s no uniformity on the web, there’s always a new challenge to overcome in this landscape. It wouldn’t be an overrating if we said we’ve done a good job at that so far and have developed the knack :)

Reach out to us for any kind of web scraping/ crawling- either AJAX or not. We’ll take care of the complexities.

Source: https://www.promptcloud.com/blog/web-scraping-interactive-ajax-crawls/

Monday, 10 November 2014

My Experience in Choosing a Web Scraping Service

Recently I decided to outsource a web scraping project to another company. I typed “web scraping service” in Google, chose six services from the first two search result pages and sent the project specifications to all of them to get quotes. Eventually I decided to go another way and did not order the services, but my experience may be useful for others who want to entrust web scraping jobs to third party services.

If you are interested in price comparisons only and not ready to read the whole story just scroll down.

A list of web scraping services I sent my project to:

    www.datahen.com - Canadian web scraping service with nice web design
    webdata-scraping.com - Indian service by Keval Kothari
    www.iwebscraping.com - India based web scraping company (same as www.3idatascraping.com)
    scrapinghub.com - A scraping service founded by creators of Scrapy
    web-scraper.com - Yet another web scraping service
    grepsr.com - A scraping service that we already reviewed two years ago

Sending the request
All the services except scrapinghub.com have quite simple forms for the description of the project requirements. Basically, you just need to give your contact details and a project description in any form. Some of them are pretty (like datahen.com), some of them are more ascetic (like web-scraper.com), but all of them allow you to send your requirements to developers.

Scrapinghub.com has a quite long form, but most of the fields are optional and all the questions are quite natural. If you really know what you need, then it won’t be hard to answer all of them; moreover they rather help you to describe your need in detail.

Note, that in the context of the project I didn’t make a request for a scraper itself. I asked to receive data on a weekly basis only.

Getting responses

Since I sent my request on Sunday it would have been ok not to receive responses the same day, but I got the first response in 3 hrs! It was from web-scraper.com and stated that this project will cost me $250 monthly. Simple and clear. Thank you, Thang!

Right after that, I received the second response. This time it was Keval from webdata-scraping.com. He had some questions regarding the project. Then after two days he wrote me that it would be hard to scrape some of my data with the software he uses, and that he will try to use a custom scraper. After that he disappeared… ((

Then on Monday I received Cost & ETAT details from datahen.com. It looked quite professional and contained not only price, but also time estimation. They were ready to create such a scraper in 3-4 days for $249 and then maintain it for just $65/month.

On the same day I received a quote from iwebscraping.com. It was $60 per week. Everything is fine, but I’d like to mention that it wasn’t the last letter from them. After I replied to them (right after receiving the quote), I received a reminder letter from them every other day for about a week. So be ready for aggressive marketing if you ask them for a quote )).

Finally in two days after requesting a quote I got a response from scrapinghub.com. Paul Tremberth wrote me that they were ready to build a scraper for $1200 and then maintain it for $300/month.

It is interesting that I have never received an answer from grepsr.com! Two years ago it was the first web scraping service we faced on the web, but now they simply ignored my request! Or perhaps they didn’t receive it somehow? Anyway I had no time for investigation.

So what?

Let us put everything together. Out of six web scraping  services I received four quotes with the following prices:

Service     Setup fee     Monthly fee

web-scraper.com     -     $250
datahen.com     $249     $65
iwebscraping.com     -     $240
scrapinghub.com     $1200     $300


From this table you can see that  scrapinghub.com appears to be the most expensive service among those compared.

EDIT: These $300/month gives you as much support and development needed to fix a 5M multi-site web crawler, for example. If you need a cheaper solution you can use their Autoscraping tool, which is free, and would have costed around $2/month to crawl at my requested rates.

The average cost of monthly scraping is about $250, but from a long term perspective datahen.com may save you money due to their low monthly fee.

That’s it! If I had enough money available it would be interesting to compare all these services in operation and provide you a more complete report, but this is all I have for now.

If you have anything to share about your experience in using similar services, please contribute to this post by commenting on it below. Cheers!

Source: http://scraping.pro/choosing-web-scraping-service/

Friday, 7 November 2014

Web Scraping: Business Intelligence

Web scraping is simply getting of information that is both hidden and unhidden from the internet. Web scraping is one of the latest technologies used in harvesting data from WebPages. It has been used to extract useful information for practical and beneficial applications and its interpretation has been tested in decision making. Web scraping is a new term that overshadows the traditional data harvesting technique that was used before. It has been regarded as knowledge discovery in databases for research and even marketing monitoring.

This article explores the various business intelligence ways in which web scraping can be used to be of importance.

Web scraping services has been used by many companies that have a strong customer focus. These companies range from sectors like retail, financial services, and marketing and communication organizations. It quite important to realize that web scraping has great signifies and impact in the varied commercial applications for the better understanding and prediction of the critical data. The data may range from stocks to consumer behaviors. The consumer behaviors are better shown in trends like customer profiles, purchasing and industry analysis among others.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-business-intelligence/