Medicinenet.com Web Scraping: April 2015

Tuesday 28 April 2015

Scraping a website from a windows service

Question

Hi there. I have a windows forms application that scrapes a website to retrieve some data. I would like to implement the same functionality as a windows service. The reason for this is to allow the program to run 24/7 without having a user signed in.

To that end, my current version of the program uses a web browser control (system.windows.forms.webbrowser) to navigate the pages, click the buttons, allow scripts to do their thing, etc. I cannot figure out a way to do the same without the web browser control, but the web browser control cannot be instantiated in a windows service (because there is no user interface in a web service).

Does anyone have any brilliant ideas on how to get around this?

Thank you very much!

Answers

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

All replies

You are not telling if you are using a .NET Express edition or not

You are not telling which Framework

You are not realy saying what data you are getting from the web site.

So

I made an example of service that work on any Studio edition (including the Express)

to install it, I supposed that you have at least the Framework2, so you will use something similar to:

    %SystemRoot%\Microsoft.NET\Framework\v2.0.50727\installutil /i C:\Test\MyWindowService\MyWindowService\bin\Release\MyWindowService.exe

In the example, I supposed that you are downloading some file from the site

You will need a reference to Windows.Form for the timer

Imports System.ServiceProcess

Imports System.Configuration.Install

Public Class WindowsService : Inherits ServiceBase

Private Minute As Integer = 60000

Private WithEvents Timer As New Timer With {.Interval = 30 * Minute, .Enabled = True}

Public Sub New()

    Me.ServiceName = "MyService"

    Me.EventLog.Log = "Application"

    Me.CanHandlePowerEvent = True

    Me.CanHandleSessionChangeEvent = True

    Me.CanPauseAndContinue = True

    Me.CanShutdown = True

    Me.CanStop = True

End Sub

Private Sub Timer_Tick(ByVal sender As Object, ByVal e As System.EventArgs) Handles Timer.Tick

    If IO.File.Exists("C:\MyPath.Data") Then IO.File.Delete("C:\MyPath.Data")

    My.Computer.Network.DownloadFile("http://MyURL.com", "C:\MyPath.Data", "MyUserName", "MyPassword")

    'Do Something with the data downloaded

End Sub

End Class

<Microsoft.VisualBasic.HideModuleName()> _

Module MainModule

Public TheServiceName As String

Public Sub main()

    Dim TheServiceApplication As New WindowsService

    TheServiceName = TheServiceApplication.ServiceName

    ServiceBase.Run(TheServiceApplication)

End Sub

End Module

<System.ComponentModel.RunInstaller(True)> _

Public Class WindowsServiceInstaller : Inherits Installer

Public Sub New()

    Dim serviceProcessInstaller As ServiceProcessInstaller = New ServiceProcessInstaller()

    Dim serviceInstaller As ServiceInstaller = New ServiceInstaller()

    serviceProcessInstaller.Account = ServiceAccount.LocalSystem

    serviceProcessInstaller.Username = Nothing

    serviceProcessInstaller.Password = Nothing

    serviceInstaller.DisplayName = "My Windows Service"

    serviceInstaller.StartType = ServiceStartMode.Automatic

    serviceInstaller.ServiceName = TheServiceName

    Me.Installers.Add(serviceProcessInstaller)

    Me.Installers.Add(serviceInstaller)

End Sub

End Class

Hello Andy,

Thanks for your post.

What do you want to scrape from the page? HttpWebRequest class ans WebClient class may be what you need. More information, please check:

The HttpWebRequest class provides support for the properties and methods defined in WebRequest and for additional properties and methods that enable the user to interact directly with servers using HTTP.

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI

http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

If you have any concenrs, please feel free to follow up.

Best regards

Hi Andy,

What about this problem on your side now? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards

Hi Andy,

When you come back, if you need further assistance about this issue, please feel free to let us know. We will continue to work with this issue.

Have a nice day.

Best regards

Thank you for the reply. Sorry it has taken me so long to respond. I did not receive any notification that someone had replied!

I am using Visual Studio 2010 Ultimate Edition and the .NET framework 4.0. Actually, I am upgrading some old code written in VB 6.0, but I can use the latest and greatest thats available.

The application uses a browser control to go to the page, fill in values, click on UI elements, read the HTML that returns, etc. The purpose of the application is to collection useful information regularily/automatically.

I know how to create a web service, but using the web control in such a service is problematic because the web browser control was meant to be placed on a windows form. I am not able to create a new instance of it in a project designated as a windows service.

Andy

Thank you for the reply. Sorry it has taken me so long to respond. I did not receive any notification that someone had replied!

I thought a web request was for web services (retrieving information from them). I am trying to retreive useful information from a website designed for interaction by a human, such as selecting items from lists and clicking buttons.   I currently use a web browser control to programmatically do what a person would do and get the pages back which in turn get parsed.

Andy

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

Thanks for the suggestion. I will go to that link and see if it will work. I will update this post with what I find.

I am writing to check the status of the issue on your side. Would you mind letting us know the result of the suggestions? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards

Hi Liliane

Thank for the follow up reply. I don't have an answer as of yet. Implementing this is going to take time and I haven't been given the go-ahead by my boss to spend the time to pursue it.

Hi Andy,

Never minde. You could have a try when you feel free. If you have any further questions about this issue, please feel free to let us know. We will continue to work with you on this issue.

Have a nice day.

Best regards

Source: https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5d565b1-236b-43c2-90c7-f5cc3b2c341b/scraping-a-website-from-a-windows-service

Monday 27 April 2015

Data Mining and Market Research

Online market research attributes to success and growth of many businesses. Online market research in simple terms we can say it is the learning of current and the latest market situations which involve surveys, web and data mining modules. To date research by use of the internet it is very important since it depends on data gathered from internet services and then one can recognize that market research keeps the business successful.

A number of managers in small businesses have a mental deem in that online market research is obligatory to big or larger companies. For true you will understand that whether the businesses is medium, large or small actually need online marketing research and this is a reason why the significance of the process and allegation will approve the targeted and potential clients. In this case Data mining progression is employed to streamline on what targeted and potential clientele needs. Areas where data mining is used:

Preferences. In any given service or product, you will learn what a customer looking for and how your product or service is different from other competitors. By use of Data mining you will be able to determine the customer preferences and you will be able to modify your services and products meet the customer choice.

Buying patterns. What is known and created for purchasing patterns from different customers. A situation can be that customers try to spend a lot on certain products and little on others. But through Data mining it is easy to understand such purchasing patterns and finally plan the appropriate techniques to be used in the marketing.

Prices. To find out whether the company is selling its products to the clients or not prices are the key factors to take into account. One should understand on the right selling price of the products. Web scrapping is easier to find the suitable pricing.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/data-mining-market-research/

Wednesday 22 April 2015

SEO No No! Scraping & Splogging – Content Theft!

Until recently, you could as well as might possibly not have acknowledged how you can perform the earlier mentioned. Even so, the following element could be the really cool element.

Several. Get back to ScrapeBox Add-Ons and also down load your ScrapeBox Blog Analyzer add-on. Open it upwards, and transfer the actual .txt record you merely rescued. Struck start.

ScrapeBox goes through almost every back link you merely scraped and look these phones determine if these are your site that will ScrapeBox presently facilitates placing comments in. If it is, that turns environmentally friendly. If it isn’t, that turns reddish. Soon after it really is concluded, it is possible to “clean” the list insurance agencies the idea remove unsupported websites.

Just what you’re destined to be left with is ALL of the sites the competitor has back-links via, and most importantly, they all are capable of being mentioned in employing ScrapeBox!!

Help save that will “clean” listing with a report, import it this list involving websites you wish to touch upon, and then keep to the exact same steps you’d probably typically follow for you to touch upon websites. Inside of Ten mins you’ll have got all the comps website backlinks (which may be blocked by Public relations if you’d just like) along with you’ll be able to reply to every one of them inside a 20 min (because the list most likely won’t end up being Large).

Desire to force this specific even more?? Obviously you are doing, you’re in BHW

Each step is the same as over with the exception of one tiny issue as well as the addition of an extra step.

Instead of just employing a single foot print inside your first bounty (both from SB’s regular gui after which also the back link checker add-on) you’re likely to be using a A lot of open all of them. Here is what you do to consider this particular to a whole new amount.

Initial, you’re going to pick each of the URLs via AOL, Aol, Ask & Search engines using this footprint:

site:domainyourcompetingwith.org

That will go back ALL the at present found web pages in the area. Remove copy Web addresses along with save that will with a .TXT report.

Now, you’re planning to create the subsequent right in front of each of these URLs:

hyperlink:

Right now follow all of the steps while outlined above. Exactly what this may is actually obtain each of the backlinks to every single site of the rivals web site.

Because Google Back link Checker is simply capable of getting the first 1k Web addresses through Aol (while that’s all Google allows you to view) you could have missed out on a decent amount associated with website inbound links if they had been at night 1st 1k final results. Consequently performing the aforementioned further methods ensures that every brand new web site in the website anyone pay attention to backlinks implies a fresh and other pair of a listing of back-links that is possibly 1k back links long.

Now you understand how to locate, filtration along with take your competitors back links, stop looking at and also move and take action!

Source: https://freescrapeboxlist19.wordpress.com/

Monday 20 April 2015

What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/