June 25, 2018

Data is everywhere. Newspapers, TV and Internet are full of text and images. But we can not get its Data. Except the Web because the Content here is digital. This information can be reused and transformed.

Scraping website text. Big digital newspaper


Lets imagine we have very popular newspaper in Network. And our manager ask to find Contacts in Last 200 Issues. How do you prefer to search for this information? Manually? Okay. How much days you need to finish the Task? I suggest 2-3 weeks. But the Boss wont be happy the time for Result.
The other way is getting all 200 newspaper edition automatically using special Bots. And than look for the Data in Archive on your PC. Sure not manually, but with special Software.

Open or Closed door?

I am sure you heard a lot of some Open Source Data. But not everybody understand what is that and how we can use it.
Global Web monster become so huge because we are feeding it. For the Last 5 years we created 10 billion gigobytes of new fresh Information. Big massive of Data is DataBases. Sure lot of people have private life in Internet and do not share it. But the same moment tons of Public sets can be found for our business.


What public sets we can find now:

  • Government data
  • Elections statistics
  • Weather
  • Open street maps
  • Wikipedia

Why closed Content is not free to use?

Legal world is understanding that every person who want to protect the rights, can go to court and defend himself. And the judge’s decision can be that you need pay a fine.

So if I want to be law-abiding person than I need to follow 3 rules:

  • Ask the owner to get the Article
  • Write the mail that you published and can delete if the author is against
  • Publish the Data with the copyright of author

The last item is best. But be carefull with the Copyright document. At the bottom of the main page have the link to it. Read it before some actions with Information.

Lot of people do not know is it good or not to copy paste the Data. The common rule is that Info can be used for free only for personal or educational purpose. But what if we just get the Data and do not share it? What if we just save the Excell file with all DataBase records?
Everything depends on commercial secret. If you break the Website, get its DataBase and sell it than its bad for you. Wait for the Police. But if you scrape the opened Data for all visitors of website than you do not break any regulation.

What information can bots scrape?


We understand that Internet changed much for the last 7 years. Now we have so different Content for every person. If you mad or happy, you cry or celebrate anniversary Web wont let you to lose the Time. We can find tons of Youtube videos and cute Gifs. Instagram images can save us from bad mood.

Python bots can scrape next information:

  • Online book Texts
  • Instagram Images
  • Gifs from Giphy
  • Youtube Videos
  • Facebook or Twitter posts
  • Amazon goods
  • Wikipedia articles
  • Yellow page contacts
  • Sport match scores
  • etc..

I can continue the List up to 500 items. The main problem that every website is a unique car. It can be protected from scraping. And the developer can not find the right keys. Sure good experienced ingeneer can get the Data almost from any webpage.

