0:08

A typical data science project will be structured in a few different phases

that I'll talk about in separately in this lecture.

So there's roughly five different phases that we can think about

in a data science project.

The first phase is the most important phase,

and that's the phase where you ask the question and

you specify what is it that you're interested in learning from data.

Now, specifying the question and kind of refining it over time is really

important because it will ultimately guide the data that you obtain and

the type of analysis that you do.

Part of specifying the question is also determining the type of question that

you are gonna be asking.

There are roughly six types of questions that you can

ask going from kind of descriptive, to exploratory, to inferential,

to causal, to prediction, predictive and mechanistic.

And so figuring out what type of question you're asking and

what exactly the question is, is really influential.

And so you should spend a lot of time thinking about this.

Once you've kind of figured out what your question is, but

typically you'll get some data.

Now, either you'll have the data or you'll have to go out and get it somewhere or

maybe someone will provide it to you, but the data will come to you.

And then the next phase will be exploratory data analysis.

So this is the second part, there are two main goals to exploratory data analysis.

The first is you wanna know if the data that you have is suitable for

answering the question that you have.

Then so this will depend on a variety of factors depending on very basic things

like is there enough data, are there too many missing values, things like that.

To more fundamental ones, like are you missing certain variables or

do you need to collect more data to get those variables, etc?

The second goal of exploratory data analysis is

to start to develop a sketch of the solution.

And so if the data are appropriate for

answering your question, you can start using it to kinda sketch

out what the answer might be to get a sense of kinda what it'll look like.

This can be done without any formal modeling or any kind of the statistical

testing of things like that just to get a good picture of what it might be.

The next stage, the third stage, is formal modeling.

So if you're sketch kind of works out,

you've got the right data and it seems appropriate to move on,

the formal modeling phase is the way to kind of specifically write

down what questions you're asking, what parameters you're trying to estimate.

And it also provides a framework for challenging your results.

So just because you've come up with an answer in the exploratory data analysis

phase doesn't mean that it's necessarily going to be the right answer and

you need to be able to challenge your results through a variety of

approaches where the sensitivity analysis are other types of analysis.

So challenging your model and just developing a formal framework is really

important to making sure that you can develop robust evidence for

answering your question.

The next phase is interpretation so once you've done your analysis your

formal modeling you wanna think about how to interpret your results and

there are a variety of things to think about in the interpretation phase

the data science project.

The first is kinda like think about how your results jive with kinda what

you expected to find when you where first asking the question.

And also you wanna think about the kind the totality of the evidence

that you've developed.

At this point, you've probably done many different analysis,

you probably fit in many different models.

And so you have many different bits of information to think about and

part of the interpretation phase is to kind of

assemble all that information to weigh the different pieces of evidence.

So that you know what kind or

which are more reliable, which are more important than others and to get a sense

of the totality of evidence with respect to kind of answering the question.

3:45

The last phase is the communication phase.

Any data science project that is successful will wanna

communicate its findings to some sort of audience.

Now that audience may be internal to your organization, it may be external,

it may be to a large audience or even just a few people.

But communicating your findings is an essential part of data science in it

because it informs the data analysis process and

a it translates your findings into action.

So that's the last part which is not a formal part of a data science project

necessarily, but often there will be some decision that needs to be made or

some action that needs to be taken.

And the data science project will have been conducted in support

of making a decision or taking an action.

So that last phase will depend on more than just the results of the data size or

the data analysis, but may require

inputs from many different parts of an organization or from other stakeholders.

So ultimately if the decision is made,

the data analysis that was done will inform that decision and will support and

the evidence that was collected will support that decision.

So these are roughly the five phases of a data science project.

There's the question, there's exploratory data analysis, there's formal modeling,

and there's interpretation, and there's communication.

4:59

Now, there is another approach that can be taken,

it's very often taken in data science project.

And that is to really start with the data and

to start with an exploratory data analysis.

So often there will be a data set available, But,

it won't be immediately clear kind of what the data set will be useful for.

So it can be useful to kind of do some exploratory data analysis, to look at

the data, to summarize it a little bit, make some plots, and see what's there.

And to generate some interesting questions based on the data.

So this is sometimes called hypothesis generating because it kind of produces

questions that were already there.

Once you've produced the questions that you wanna ask,

based on your initial kind of exploratory data analysis,

it may be useful to kind of get more data or other data

to kind of do an exploratory data analysis that's specific to your question now.

And then continue with the formal modeling, interpretation and

communication.

One thing that you have to be wary of is to do the exploratory data analysis in one

data set, develop the question, and then go back to the same data set.

And pretend like you hadn't done the exploratory data analysis before and

come at it with say a fresh question.

That goes on to the rest of the stages.

This could often be a recipe for kind of, for bias in your analysis.

Because the results were derived from the same data set.

So it's important to be careful about doing that and to try to obtain other data

when you're using the data to generate the questions in the first place.

So this is the secondary approach to data science that can be very useful and can

often result in many interesting questions that are generated from the data.

Data science projects have a variety of phases and it's important to kind of

understand which phase you're in so that you know kind of how to progress and

how to move forward with any data science project.