Data Science lifecycle
Contents
%%html
<!-- The customized css for the slides -->
<link rel="stylesheet" type="text/css" href="../../assets/styles/basic.css"/>
<link rel="stylesheet" type="text/css" href="../../assets/styles/python-programming-basic.css"/>
43.8. Data Science lifecycle#
43.8.1. 1. What’s Data Science lifecycle#
Data Science is a process which can be broken down into 5 stages:
Capturing
Processing
Analysis
Communication
Maintenance
1. What is data science? (n.d.). UCB-UMT. Retrieved 19 February 2023, from https://ischoolonline.berkeley.edu/data-science/what-is-data-science/
43.8.2. 2. Capturing#
The capturing is combined by two stages: acquiring the data and defining the purpose and problems that need to be addressed.
Defining the goals of the project will require deeper context into the problem or question.
First, we need to identify and acquire those who need their problem solved. These may be stakeholders in a business or sponsors of the project, who can help identify who or what will benefit from this project as well as what, and why they need it. A well-defined goal should be measurable and quantifiable to define an acceptable result.
Questions a data scientist may ask:
Has this problem been approached before? What was discovered?
Are the purpose and goal understood by all involved?
etc.
Questions a data scientist may ask about the data:
What data is already available to me?
Who owns this data?
etc.
43.8.3. 3. Processing#
The processing stage of the lifecycle focuses on discovering patterns in the data as well as modeling.
Classification: organizing data into categories for more efficient use.
Clustering: grouping data into similar groups.
Regression: determining the relationships between variables to predict or forecast values.
43.8.4. 4. Maintaining#
In the diagram of the lifecycle, you may have noticed that maintenance sits between capturing and processing.
Maintenance is an ongoing process of managing, storing, and securing the data throughout the process of a project and should be taken into consideration throughout the entirety of the project.
43.8.4.1. Storing data#
Considerations of how and where the data is stored can influence the cost of its storage as well as the performance of how fast the data can be accessed
On premise vs off premise vs public or private cloud
Cold vs hot data
43.8.4.2. Managing data#
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the section focused on data preparation building accurate models.
When new data arrives, it will need some of the same applications to maintain consistency in quality.
43.8.4.3. Securing the data#
One of the main goals of securing data is ensuring that those working on it are in control of what is collected and in what context it is being used.
Confirm that all data is encrypted. Provide customers with information on how their data is used.
Remove data access from those who have left the project.
Let only certain project members alter the data.
etc.
43.8.5. 5. Analyzing#
Analyzing the data lifecycle confirms that the data can answer the questions that are proposed or solve a particular problem.
Exploratory data analysis
Data profiling, descriptive statistics, and Pandas
Sampling and querying
Exploring with visualizations
Exploring to identify inconsistencies
43.8.6. 6. Communication#
To communicate is to convey or exchange information.
Information can be ideas, thoughts, feelings, messages, covert signals, data – anything that a sender (someone sending information) wants a receiver (someone receiving information) to understand.
Data communication & storytelling
Types of communication
One way communication
Two way communication
43.8.6.1. Effective communication#
Your responsibilities as a communicator
Understand your audience, your channel & your communication method
Begin with the end in mind
Approach it like an actual story
Use meaningful words & phrases
Use emotion