A Long Rant About Data Science
The definition of data science is fairly broad and there’s generally no common consensus among the academia community; it is still in the process of being defined as an academic field.
The central tenets,concepts,knowledge,skills, and ethics powering this emerging discipline remain points of active discussion and continue to evolve.
However data science is a field that has emerged from the multidisciplinary principles of mathematics, statistics,information technology,operations managements,business analytics, computer science and economics. It relates to the
- collection,
- organization,
- storage,
- analysis
- inference
- presentation
- communication and
- ethics appropriate for this new data-driven era
using computational tools, mathematical methods and algorithms to produce meaningful results, predictions, recommendations, descriptions and explanations about the data being examined.
With billions of autonomous sensors and devices continually delivering data to cloud based databases, which record the states and activities of vehicles, buildings, customers, patients and citizens, there is more data today in the world than ever was before.
What does a data scientist do?
Some current areas of focus for data scientists include the following:
Computing hardware and software for data science.
Data scientists who manage the platforms on which data science models are created focus on understanding and maintaining a computing environment that meets the demand for big data, fast(sometimes real time or near real time) model generation, and data interrogation - up to and including the demands of real-time data collection(i.e streaming) and complex data visualizations. A significant challenge of this job is remaining current on the latest computing hardware and software.
These data scientists create environments for data science modelers and analysts that can be used across a range of computing platforms. This requires that they understand the changing programming languages used for data science, the supporting libraries, and many types of data storage systems, as well as how to keep all these components operational and secure.
It required that they have training in database maintenance, security,programming hardware, and operating systems.
Data storage and access.
Data scientists who focus on managing data storage solutions as well as extracting,transforming and loading data for modelling should have the ability to manage large datasets from a variety of heterogeneous data sources and in batch or streaming form, and to assess the predictive value of these data sources. A strong knowledge in both databases and streamed analytical processing is key to these role. These data scientists need to understand the data science workflow, to document data quality problems, and to select appropriate methods of interpolation - even, in some cases creating data models to clean and reduce errors in downstream model development performance. Some domain knowledge is likely needed.
Statistical modeling and Machine learning
Experts in statistical modelling and machine learning interface with stakeholders to capture requirements and develop the scope of work for data science projects, undertake the data science analysis cycle, and typically bridge the gaps among more narrowly focused data science roles. Written and oral communication skills are essential for this position, as is coordinating teams. Often these data scientists require considerable domain expertise in the field for which the data science models are being developed.
For example, an individual developing a model for clinical trial analysis for drug development would need to have significant understanding of pharmacology and clinical data collection. It is a broad and complex position and requires significant training.
Data Visualization
Ideally, data visualization experts combine development and design skills with the ability to understand the meaning of the analyses. These data scientists are adept at visual storytelling with data. They can examine large datasets and create, clear, efficient, compelling online layouts, images, dashboards, and interactive features that can stand on their own or complement narrative text. At their core, they are effective translators between technical and statistical specialists and superior communicators with multiple non-technical audiences. They are well versed in the key elements of graphical displays as well as pitfalls of misrepresenting data and results.
These data scientists combine knowledge of statistical analysis tools, libraries, frameworks to complement a foundation in computational, statistical and data management methods. They understand APIs, how to parse them and, ideally, how to build them and are closely aligned with the data management functions performed by others on a team.
Business Analysts
These data scientists are involved in making sense of and communicating about data without necessarily relying on programming skills. These jobs are built around assembling and presenting data to inform a decision making process. Common in many business areas.
There are many types of data scientists today, and their roles will continue to change and expand in the future. Beyond the differences among them, there is considerable variance in the level of knowledge and skills that some data science jobs require. There are also commonalities among the varied types of data scientists. All data scientists need to learn how to tackle questions with real data.
An effective data science workflow involves formulating good questions, considering whether the available data are appropriate for addressing a problem, choosing from a set of different tools, undertaking analyses in a reproducible manner, assessing analytic methods, drawing appropriate conclusions, and communicating results.
How I ended up in data science.
I really do not have a specific time that I can say I began practising data science. However I generally loved solving problems and puzzles from early on in my childhood. The point at which I would say I began data science was when I was in high school, learning basic statistics, then more so after I graduated from high school and taught myself how to program in python.
After picking up some basic skills I found myself interested in hacking, which on further self education I realized that hacking generally involved gathering information about a target and performing some analysis then using the knowledge gathered to prepare an attack strategy.
This led me to Linux, the operating system which was very handy in gathering information, I was now exposing myself to networked environments and dealing with small amounts of network data.
College
I realized that it required a more in depth understanding of networks and so I enrolled for a bachelors degree in mathematics and computer science at the Jommo Kenyatta University of Agriculture and Technology.
It was here that I was formally introduced to mathematical and computational thinking which opened my eyes to a whole new world.
Computer Science is generally about problem solving using computational tools. I gained an understanding and appreciation of computational complexity in algorithms, data structures, data mining, data storage and retrieval techniques and their trade-offs.
With also a rigorous schooling in mathematics especially in the branches of statistics, linear algebra, calculus and Analytics I widened my horizon and perception of data, computation and analytics.
I was now able to handle large data intensive and complex problems as well as the ability to develop programs and software to perform advanced scientific computing and implement machine learning algorithms to a wide variety of problems.
Learning on My own
With a formal introduction I was now able to take my skills a notch higher on my own. I would collect some real world data and apply my learnt skills and tools to the datasets and develop a useful tool from the data. From predicting sports outcomes, to predicting stock prices, house prices, categorizing flowers. I would generate visualizations in charts, maps, graphs and simulations using datasets.
Up to now I am still learning and sharpening my skills with emerging technologies and ideas in academic research papers and the process never ends as there’s no limit.
So do you need a math or computer science degree to become a data scientist?
You can become a data analyst without a college degree. In my experience of over 5yrs, I have witnessed many people with a master’s degree who failed to make a successful career in Data Science whereas some people without a specialised degree in data science succeed.
In this modern era if you have the right skills, determination and are passionate, a degree won’t hold a barrier to your dream.
That being said a degree would not hurt just to be safe.
Skills
I would say if you have the right skills or if you have the will to develop the skills and have passion for Data Science, you can become a Data Scientist.
I can list out few of the skills that are essential to become a Data Scientist.
Technical Skills
- Math (e.g. linear algebra, calculus and probability)
- Statistics (e.g. hypothesis testing and summary statistics)
- Machine learning tools and techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.)
- Software engineering skills (e.g. distributed computing, algorithms and data structures)
- Data mining
- Data cleaning and munging
- Data visualization (e.g. ggplot and d3.js) and reporting techniques
- Unstructured data techniques
- R and/or SAS languages
- SQL databases and database querying languages
- Python (most common), C/C++ Java, Perl
- Big data platforms like Hadoop, Hive & Pig
- Cloud tools like Amazon S3
Business Skills
- Analytic Problem-Solving: Approaching high-level challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
- Effective Communication: Detailing your techniques and discoveries to technical and non-technical audiences in a language they can understand.
- Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve problems.
- Industry Knowledge: Understanding the way your chosen industry functions and how data are collected, analyzed and utilized.
Actions to Take
If your are considering getting into data science and becoming a data scientist, here are a few recommendations based on industry knowledge and personal experience:
Acquire Basic skills
One can start from the basics i.e. normal distribution, central limit theorem, hypothesis testing and then move on to advanced techniques viz. linear regression, logistics regression, decision trees, cluster analysis, generalized additive models, etc. A recommended book for this would be The elements of statistical learning (by Hastie, Tibshirani and Friedman).
It is expected that an aspiring data scientist should have some familiarity with various statistics or machine learning methodologies used in the industry.
Apart from the standard textbooks, an alternative but effective way of learning would be going for MOOCs. There are a lot of free statistics/ data mining courses available via Coursera, edX, MIT open, Stanford online, NPTEL, etc.
Learn the Tools of the Trade
As far as the tools in analytics industry is concerned, SAS and SPSS used to be popular before the open source revolution took the industry by storm. Open source tools like R and Python are the next big thing and it would make sense to invest time on them.
There are enough freely available resources on the web to learn both R and Python. For people with coding skills in object oriented languages like Java will find Python intuitive. But R is the best tool (personal opinion) when it comes to statistical modeling and it is also the preferred tool in academia.
For an absolute beginner, the introductory course in R at Learn R, Python & Data Science Online | DataCamp can be a starting point. But the best way to learn these softwares is by doing. So I would suggest that one should replicate the codes available and test it on some dummy data sets to understand what`s going on. Also, a working knowledge of SQL along with advanced MS Excel / VBA skills can act as a differentiator when one appears for their interview.
Since data science is not only about technical mumbo jumbo so it would be really be helpful if one understands the business applications of it and one is also aware of various successful use cases.
This will help one see the bigger picture and also make one well equipped to understand what kind of methodology fits for a particular business problem.
For example, how market basket analysis is used for product bundling by retailers, how cluster analysis can be used for customer segmentation for a new product launch, how logistic regression can be used for fraud detection in banking/ insurance sector, etc.
Practice, Go out there
The last but not the least would be – practice, practice and practice. One way to do it would be by participating in various data science competitions hosted in sites like kaggle.com. Even analyticsvidhya hosts data science competitions.
But I would suggest to go through some of the past competitions at kaggle and replicate some of the scripts to understand the modus-operandi. The level of competitions at kaggle is high and one can learn how to handle challenging datasets and come up with a solutions.
Also, the discussion on the forums with like-minded data science enthusiasts can be helpful.
Put your work out there and see what others are doing like on github, bitbucket, gitlab and the like. Fear not.