Last post May 22, 2019 01:33 PM by Maximili
Apr 12, 2019 11:44 PM|WilliamSnell|LINK
I've built a web scraper that logs into a particular site, and then uses a collection of id's to make requests for specific users, and scrapes some of the resulting HTML. I used ScrapySharp when I couldn't get Iron Web Scraper to authenticate correctly,
which worked out well for me, since ScrapySharp is free to use, and Iron Web Scraper costs $399, at a minimum. I'd still like to know why the HttpIdentity property of Iron Web Scraper didn't authenticate to the site I'm querying, but that's for another day.
ScrapySharp is working well for me right now. However, I'm processing over 60,000 id's, and it takes FOREVER. So I needed to implement asynchronous methods for better performance.
This code loads the login page and submits the login form:
var homepage = _browser.NavigateToPage(_loginUri);
var form = homepage.FindForm("login");
form.Method = HttpVerb.Post;
form["loginusername"] = _configuration.GetValue<string>("LoginCreds:username");
form["loginpassword"] = _configuration.GetValue<string>("LoginCreds:password");
var resultPage = form.Submit();
This was pretty easy. ScrapySharp has methods for finding elements by name (which is what I used) or by id. In my case, the form wasn't given an id. After logging in, there's no need to pass any tokens around. The ScrapingBrowser object holds the state.
The following code sets up the requests:
Microsoft's example (doesn't work)
public void ScrapeAsync()
// Create a collection of query tasks.
IEnumerable<Task> taskList =
from uri in _uriRequests select RequestTempData(uri);
// I tried several approaches, one from the example on the Microsoft site - no dice. // Task.WaitAll finally gave me what I needed - async processing that I can work on after completion.
var json = JsonConvert.SerializeObject(_concurrentResults); // JToken.Parse formats JSON so it's not simply output on a single line.
That's my setup. Here's the code that performs the actual scraping:
private async Task RequestTempData(string uri)
var loggedInPage = await _browser.NavigateToPageAsync(new Uri(uri));
var rows = loggedInPage.Html.CssSelect("#tableId tr").ToList();
if (rows.Count == 1)
// Skip last row - input fields for adding new row.
for (var r = 0; r < rows.Count() - 1; r++)
var tableDataCells = rows[r].CssSelect("td").ToList();
var result = new MyResultObj
Id = RemoveSpaceChars(tableDataCells.InnerText),
Description = RemoveSpaceChars(tableDataCells.InnerText),
Percent = RemoveSpaceChars(tableDataCells.InnerText)
catch (Exception ex)
Synchronous code performing this was VERY slow. It took 10 minutes to process about 1200-1400 records. With the asynchronous code, I can process that many in about 40 seconds. I save the skipped Id's to one file, and formatted JSON to the other file. This
was a royal pain to get set up correctly, so I wanted to leave this for others that might find it useful.
Apr 13, 2019 08:44 AM|yogyogi|LINK
You are using Iron Web Scraper which costs $399, you can simply create your own web scraper that does the
asynchronous scraping. For asynchronous scraping it uses jQuery AJAX method. Link for this is given below:
Apr 13, 2019 01:35 PM|mgebhard|LINK
I had a situation where a company did exactly what you're doing and programmatically logged in to an application that I supported and scrapped the screen. I did not find about it until we started moving from classic ASP to .NET. The updates caused URL
and markup changes. Well, this caused the screen scrapper app to fail and a department of around 20 folks could no longer work. The developer(s) that wrote the application were no longer around. So this company submitted a ticket and it took me a while
to figure out the problem because obviously their scrapped screens were different than the original pages.
IMHO, if you own the application being scrapped, then consider building a Web Service. If you are scrapping a 3rd party, let them know. In my situation we had services so there was no reason for the screen scrapping.
May 22, 2019 01:33 PM|Maximili|LINK
Hi. As I know, the scrappy library does asynchronous methods, and they have their community. Or there is another one is
https://scrapy.org/, here you will find some tools and opinions about what is better to do. I hope that will be helpful for you. Nice day!