Introduction to Data Science

Introduction 

Data science creates lot of buzz since past few years. There are so many questions come into our mind when we heard the term data science such as Why this field creates a lot of buzz , What kind of Data is needed for Data science,What are the important aspects of the data science , What are the applications of Data science, What are the techniques available to solve data science related problems, How can anybody can getting into the data science.etc… let’s check out all the questions related to data science.

What is Data Science?

Let’s go back in 1990, when world wide web is evolved , slowly and gradually people are using this powerful invention from last 25 years and making it batter.

The data volumes are exploding, more data has been created in the past two years than in the entire previous history of the human race on web.[1]

Now a days world wide web is the major resources of the data.People from all over the world using web everyday .This usage generate lot and lots of data. According to the report on EMC, In 2013 [2], we have 4.4 zettabytes (ZB) of data on web.This ZB of data contains historical data as well as real time data. Everyday we are interacting with data whether its social media, web search,news,blogs,videos,images,documents etc..

Now we have more then enough data which can be used for extracting knowledge out of it. After analysing the data by using proper scientific techniques we can find some of the hidden pattern or facts from the data which will lead us to solve existing unsolvable questions.

“This scientific way of analysing data or extracting knowledge out of data is called Data science.”

OR

“Data science is all about making sense out of the data or extracting the knowledge from the data using data science techniques.”

What kind of Data is needed for Data science?

There are three kind of data available on web,

Structured data

  • This kind of data is highly organised. Data is stored in table
  • A data model explicitly determines the structure of data.
  • This kind of data has relational key and they are stored in relational databases.
  • Examples: Student information database,Employee information database, etc..

Semi-structured data

  • Semi-structured data is a form of structured data but it is not completely similar to the structured data.
  • It contains tags ,other markers or key-value pairs to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.[3]
  • Examples :  XML, Json, CSV

Unstructured data

  • Majorly on web we find data which does not follow any structure.
  • This kind of data is not neatly fit in to the traditional relational databases.
  • Examples: Satellite images, Scientific data, Photos, Videos, Radar data, Mobile data, Text , web content,Social media etc…

Majorly semi-structure and unstructured data set is used for solving data science related problems. There are very small set of applications in which structured data can be used.

Why does data science field create a lot of buzz?

In current era, We have lot of data, cheap but efficient hardware, tools and techniques which emerging in last few years to solve the previously  unsolvable questions , these are the factors which create buzz around the  data science.

Aspects of the Data Science

Data science is umbrella term, this field contains many other fields in it.

Data science includes Statistics, Programming, Machine Learning, Natural Language Processing(NLP), Text Mining, Visualisation, Big Data, Data Ingestion, Data Munging, Tools for data science.

the-data-science-clock-v1-1-full1
Data Science Clock [C.1]

Data science techniques

Data science techniques majorly include statistics, Machine learning and  Deep Learning for solving problems like speech recognition, Image recognition, various NLP applications, etc..

Data science tool kit

Those who are coming from the technical background can use following tools

  • Scripting language  for rapid prototyping (Scala or Python)
  • R – Statistics programming tool
  • Hadoop framwork
  • Spark
  • Deep Learning libraries tenserflow, torch,  Deeplearning4j etc…
  • Node.js
  • Social media libraries
  • Basic Machine learning libraries

Those who are coming from the non-technical background can use following tools[4]

  • RapidMiner
  • DataRobot
  • BigML
  • Google cloud prediction API
  • H2O
  • Weka

Applications

  • Internet Search –  Ranking algorithms
  • Digital advertisement – Statistics techniques heavily  used
  • Recommend system -Machine learning techniques majorly used
  • Image recognition – Deep Neural Network /Deep and wide  Neural Network
  • Speech recognition – Deep Neural Network /Deep and wide  Neural Network/Linguistics techniques
  • Gaming – Machine learning / Deep Neural Network /Deep and wide  Neural Network
  • Credit risk modelling –  Statistics and Machine learning
  • Fraud detection – Statistics, Machine learning and graph theory
  • Social Media Intelligence – NLP, Sentiment analysis, Influence detection, etc..
  • Intelligent Chat bots – Statistics, Machine learning, NLP and deep learning
  • Self driving car -Rule based system
  • Robots – under research

From next post onward, I am going to start tutorial series for data science beginners.

This tutorial series includes

  • Ubuntu for beginners
  • Tool kit list and installation guide
  • Regular expression guide
  • Scraping of the data
  • Data cleaning / pre-processing
  • Basics of  statistics
  • Basics of Machine learning techniques
  • Apply machine learning techniques on pre-processed data
  • Basics of Deep learning

References:

[1] http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#4460ae776c1d

[2] http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

[3] https://en.wikipedia.org/wiki/Semi-structured_data

[4] https://www.analyticsvidhya.com/blog/2016/05/19-data-science-tools-for-people-dont-understand-coding/

Copyrights:

[C.1] The Data Science Clock by Jamie Whitehorn is licensed under a Creative Commons Attribution 4.0 International License. see data-science clock
Permissions beyond the scope of this license may be available at http://www.exploringdatascience.com/about/copyright/.

 

 

 

 

 

 

2 thoughts on “Introduction to Data Science

Leave a comment