DescriptionITEC 621 Predictive Analytics

Predictive Analytics Project

Prof. Espinosa – Last updated 7/17/2021

Background CRISP Data Requirements D1 D2 D3 D4 D5 Teamwork

Background

The main goal of this project is to help you prepare for your practicum projects by giving you an

opportunity to put into practice what you have learned in class. The predictive analytics project

will be done in teams of maximum 4 students. It is expected that all team members will contribute

equally and that everyone will take the opportunity to learn from each other. Business analytics is

not just about analyzing data. It requires teamwork and a compelling upfront articulation of the

specific business problem or analytics question being addressed; and a clear and concise report of

the findings and conclusion. We will follow the Cross Industry Standard Process for Data Mining

(CRISP-DM) framework for this project, which maps closely to INFORMS’ Job Task Analysis (JTA)

(http://info.informs.org/analytics-body-of-knowledge; Amazon), and it is a popular framework for

analytics projects.

CRISP-DM Overview

In essence, the CRISP-DM framework (see lecture slides) includes the following activities, which

we will adopt for the project:

• Business Understanding (CRISP-DM 1)

o (JTA Domain I) Formulate the business question to be answered or problem to be solved.

All business analytics projects must be driven by business needs or business value

propositions. This requires the articulation of the respective business case, leading to the

articulation of the business question or problem.

o (JTA Domain II) Translate business question into the respective analytics question. Not all

business questions or problems are amenable to analytics solutions. The project report must

specify how or why analytics is the appropriate approach to address the business question or

problem.

• Data Understanding (CRISP-DM 2)

o Data acquisition and pre-processing can take as much as 80% of the analytics project effort.

o (JTA Domain III) Acquire and identify relationships in the data. This step involves acquiring

the data (e.g., ETL or Extract-Translate-Load) and then doing a substantial amount of

descriptive analytics, including things like (as appropriate): descriptive statistics, correlation

analysis, ANOVA, distribution curves, visual plots and other graphs, and other related

1

analysis (e.g., cluster analysis). Predictive analytic modeling should not start until you have

developed a thorough understanding of the data. If fact, this phase may uncover issues and

relationships in the data that you did not anticipate, thus leading to reformulation of the

analytics question.

• Data Preparation (CRISP-DM 3)

o (JTA Domain III) Harmonize, re-scale and clean data, as needed. Data sets often need to be

split, merged, sub-sampled (for large data sets), and cleansed. This step involves all data

pre-processing activities, such as: re-structuring the data (e.g., normalizing scales, centering,

aggregating, etc.); addressing issues of missing data; and acquiring and merging other

related data.

• Modeling (CRISP-DM 4)

o Select the appropriate analysis methodology and tools, exploring various model

specifications, and then building the respective models. In this course we use R as the

primary analytical tool.

o (JTA Domain IV) Methodology Selection. The vast majority of the course is focused on

method selection (e.g., OLS regression, Logistic regression, Ridge or LASSO, trees, etc.).

Candidate models should be identified based on the analytics goals: interpretation, inference

and/or prediction). For this project, students need to focus on models that are relatively

interpretable and then select the model that has better predictive accuracy, based on cross

validation test error or deviance.

o (JTA Domain V) Model Building. Another area of focus in this course is on model

specification (e.g., linear, polynomial, interactions, variable selection, etc.). The initial set of

predictors to be used in the model must be driven by business domain knowledge. But then

this set should be narrowed down or refined using statistical methods like cross-validation

testing.

• Evaluation (CRISP-DM 5)

o This phase is not about evaluating the models. This happens in the Modeling phase above.

This phase is about evaluating the extent to which the analysis has answered the business

and analytics questions framed in phase 1. For this project, we will focus on the following:

o Interpretation of Results: the final project reports must provide very focused interpretation

of results, in terms of effects observed, fit statistics, and predictive power of the final

model.

o An important part of this interpretation is providing a well-documented answer to the

business and analytics question.

o It is also important that you tell a compelling story in your report. Storytelling is one of the

most important skills in business analytics. Remember, this is not a statistics class, but a

business class. You must tell a compelling story for your audience. The story must be backed

up by your findings.

• Deployment (CRISP-DM 6)

o (JTA Domain VI) For this project, deployment will focus on turning in your written report,

with the necessary interpretation and stories articulated in step 5 above.

Important note: not all projects lead to amazing findings. A model that shows no effects can offer

very interesting insights. It all depends on how you rationalize the lack of effects from a business

point of view. Along the same lines, this project is not so much about what you analyzed and

found, but about how effectively you described to your readers the motivation for your study,

2

your method evaluation and selection process and what the implications of your findings from a

business perspective.

Data

Any dataset not used in class for lectures, exercises or homework can be used for this project.

Students are expected to identify an interesting external data set to work with. In the past, many

students have used Kaggle data sets used in competitions, but there are many sources of public

data. Proprietary data sets can only be used with permission of the owner of the data set. It is OK

to use data from your practicums, if you have it, and use this project as an opportunity to work

with your client’s data. Unless the data is proprietary, teams must submit the actual datasets with

their final projects so that the professor can replicate some of your work when grading.

Requirements

All projects must evaluate 3 different modeling methods (e.g., OLS, Ridge, Logistic, LDA, trees,

etc.) with 2 different model specifications for each, (e.g., different predictor subsets; polynomial,

log or other transformations; interactions, etc.).

IMPORTANT: the 2 model specifications selected above should be used in each of the 3 modeling

methods above. The best approach is to fit the first model using OLS or Logistic regression, using

both model specifications. Then, depending on your results and assumption testing, fit the same 2

specifications using two other models.

IMPORTANT: all team members must contribute their fair share of the analysis. I expect each

member to take the lead on one particular modeling method or transformations. I will be

surveying the team during the semester to evaluate how each member contributed to the project.

IMPORTANT: while you will be evaluating and testing 6 different models (3 model methods x 2

specifications), you should only report on the final model methods and specification selected, but

you must close the loop and re-fit your final model with the full dataset. There is no need to

report on all alternative models. You only need to discuss your model selection process, including

any fit statistics and cross-validation test results that guided your final selection. However, if you

wish to include output from alternative models and specifications, you can do that in an appendix.

Project Deliverables

This project has 5 deliverables:

Deliverable 1 (5 pts): Project Proposal (1 page, single-spaced)

A project proposal is due around the mid-semester point, per the class schedule. The goal in this

deliverable is to get you started on your project early and provide me with an idea of the direction

you are planning to take in your project. It is also an opportunity for me to give you feedback on

your project ideas. The proposal should contain the following sections:

3

(1) The business case – a brief rationale about the importance of this question/problem from a

business perspective. What is the value proposition of your project? The business case is the

motivation for your study. Why is this study important? And why should your client or

company devote resources to carry out the study? A business case should provide a convincing

statement articulating things like: how/why is the study important to your client? What are the

benefits that your study will provide? Or, what are the opportunity costs if you don’t carry out

the study? Business cases are most effective when your narrative is: specific, based on facts or

data, concise and to the point. By specific, we mean that it should be specific to your project.

(2) The business question – the business case should lead to one or many interesting business

questions to pursue in your project (e.g., how can we control the spread of an epidemic most

effectively?). In a real project, there will probably be more than one business question to

address, but for this project we encourage you to focus on a single business question.

(3) The analytics question – Not all business questions can be answered with analytics.

Translating a business question into an analytics question is simply providing a more detailed

formulation of the business question, such that the question is answerable through analytics. If

answering the business question requires that you analyze data, then your question is

answerable through analytics. Otherwise is not. Think of the analytics question as the verbal

translation of your predictive model. For example, if your analytics model is likely to be

something like Y ~ Focal Predictors (of interest) + Other Predictors (controls), and your focal

predictors are X1 and X2, then your analytics question would read somewhat like this: “In this

study we are interested in understanding the effect that X1 and X2 have on Y. That’s it. This will

guide your model specification. Notice that we don’t need to discuss all predictors, just the

focal predictors of interest to the study.

The analytics question should be more specifically tailored to the outcome variable you will be

using in your models (e.g., how do population density, sanitation conditions and general

population health affect the spread of an epidemic?). The analytics question will lead to either

a quantitative or classification method. Although you can change this later, at this point, you

should discuss whether your analytics question about a quantitative or classification outcome.

This will lead you into the correct modeling approach in the next deliverable; The effective

articulation of the analytics question should set you in the right direction to start building your

model; and

(4) Dataset(s) – Identify one or more possible datasets for the project. The more specific the

datasets you are contemplating the better.

Deliverable 2 (10 pts): Preliminary Data Analysis Report (2 pages of text,

single-spaced, plus appendices with R output as needed)

This deliverable is intended to get you started early on your project model method and

specification exploration. It is also meant to get you familiarized with the project data. You should

think of this deliverable as an early draft of your final report. It is also one last opportunity to get

feedback on the direction of your project.

Because all model explorations begin with either an OLS regression (for quantitative predictions)

or a Logistic regression (for classification predictions), this preliminary data analysis report will

include the following:

4

(1) IMPORTANT: your main text should only contain narratives. Place all statistical output and

plots in appendices. All appendices must be appropriately referenced in the main text.

(2) Revise and refine your project proposal as needed. More specifically, refine your business

case, business question and analytics question, as needed. Your deliverable 2 report must

include these revised items.

(3) Brief description of your dataset. In Deliverable 1 you discussed possible dataset to use. For

this deliverable, you must have settled on the specific dataset you will use in your project. You

don’t need to provide a full description of the dataset yet, but you need to provide enough

information for your professor to understand what you are analyzing. No need to provide

extensive descriptions, just the data source and the main variables you included in your

preliminary analysis. For each variable, please describe its respective variable type, unit of

measurement, and a brief description of the variable.

(4) Descriptive analytics. You must provide a brief discussion of the respective descriptive

statistics, correlation analysis, ANOVA and/or any plots you may have rendered to understand

the data and how variables relate to each other. The text in this section should be limited to a

brief analysis of the most salient aspects of this analysis. Provide a brief narrative of what you

learned from your descriptive analytics.

(5) Define an initial set of predictors for your model. These predictors must be variables in your

dataset and must be selected using business domain rationale. The initial set of predictors

should NOT be selected statistically, but you must articulate your rationale for why you chose

your initial set of predictors.

(6) If your analytics question is quantitative, run an OLS regression. If your analytics question is a

classification, run a Logistic regression. In either case you must include the predictors

identified above. Later in the project you will refine this initial set of predictors through

variable selection, best subsets, or other methods.

(7) Inspect residual and other regression plots, as appropriate, and conduct the necessary tests to

evaluate adherence to the OLS or Logit regression assumptions (e.g., multicollinearity, serial

correlation if there is time data, heteroscedasticity, linearity, etc.).

(8) Provide a brief statement of your conclusion.

Deliverable 3 (0 pts): Meet with Professor. This deliverable does not have any points

assigned but it is mandatory for the ENTIRE team. All teams must schedule a meeting with the

professor shortly after submitting Deliverable 2. This is an important step for the professor to ask

you questions about your project and for you to get additional feedback and guidance on your

project.

Deliverable 4 (65 pts): Final Report (4 to 5 pages of text, single-spaced, plus appendices

with R output as needed)

IMPORTANT: as it should be clear by now, one important learning objective in the MS Analytics

program is being able to interpret analytics results and articulate them clearly to a business

audience. The market calls this “storytelling” and it boils down to writing concisely, to the point

5

and clearly, what your results mean for the business of your client. This involves things like

interpretations of statistical output and telling a good story. Avoid grandiose statements and fluff.

Get to the point right away because the space is limited and business people like succinct but

informational writing.

The final project report will be submitted as an analytics report prepared in MS Word or knitted

with R Markdown as a Word or PDF document. Most of these sections should be an extension of

your Preliminary Data Analysis Report above. The final project report will contain the following

sections:

(1) (10 pts.) A brief but compelling business case articulating. Combine (1), (2) and (3) from your

proposal into one coherent statement (or in subsections) discussing:

a) What is the business problem and/or business question, which your study seeks to solve or

answer?

b) The rationale about the importance of this question/problem from a business perspective.

Why is the problem you are analyzing important? That is, what is the value proposition of

your project to your client or managers?

a) Articulate the analytics question in very specific terms. Note that the business question

and the analytics question are related, but they are not the same. The business question

does not need to discuss variables in detail, but articulate the general are of business

inquiry. In contrast, the analytics question should be very specific and needs to clearly

state: (1) the type of problem you are addressing (i.e., quantitative or classification); (2) the

outcome (variable) you are predicting; and (3) the focal predictors of interest (not all of

them).

(2) (5 pts.) A description of the dataset utilized for the analysis (if the data set is not available in

an R package or public web site, the data set must be attached). Your data description should

be sufficient for your reading audience to understand your data set, variables and the

interpretations you provide in your report, including variable types and units of measurement.

The data description should be accompanied by any necessary descriptive analytics artifacts

necessary for your predictive modeling (e.g., descriptive statistics, correlation matrix,

correlation plots, other plots, etc.).

(3) (10 pts.) Descriptive Analytics: Brief analysis of the study variables, from both, business and

statistical perspectives.

a) First, clearly identify and describe your outcome variable(s).

b) Then specify and briefly describe your main predictors. You don’t need to discuss all

predictors in this section, just the ones that are most central to your analytics question and

business problem. You will be selecting the final predictors later, but before you do that, it

is important to have a business rationale for including them.

c) Briefly discuss any important aspects uncovered by your descriptive analytics of the data

(i.e., visual plots, descriptive statistics, correlations, etc.)

6

d) Finally, provide a brief discussion of any pre-processing (e.g., grouping, combining

variables, etc.) and transformations done with the data (e.g., normality, logs,

standardization, non-linear, etc.) you employed for some of the variables, if any, along with

the rationale for the appropriateness of this transformation (e.g., normality, non-linearity,

non-continuous, etc.). Again, you will be selecting your model specifications later, but you

want to do some descriptive analytics early to spot any issues with the data that may

require transformations.

Please include all the necessary plots, descriptive statistics, correlation matrices, etc. in an

appendix. Do not include R output in the main text.

(4) (10 pts.) A discussion of the (a) analytics methods and (b) model specifications you evaluated

and selected. All methods used must be appropriate and relevant to the problem and you

need to provide a justification for the selected methods based on:

(a) Conformance with or departure from OLS and/or Logistic OLS assumptions, based on visual

inspections and OLS assumption tests.

(b) Predictive accuracy based on cross-validation test statistics. Similarly, the particular model

specifications utilized must have a rationale. For example, if you chose a quadratic

regression specification, you must have some rationale for the respective non-linear

relationship. All projects must be analyzed with a variety of appropriate model with

different model specification. Please consult with me if in doubt, but these are the

minimum requirements

(5) (10 pts.) Analysis and presentation of results. Your analysis and results need to contain some

narrative to allow your audience to understand what you did. A simple output and diagram

dump with no explanation will receive very little credit. Every procedure, output and diagram

needs to be briefly but appropriately introduced before and briefly commented on its meaning

after. Don’t leave it up to the reader to interpret what you did. Also, vague and general

discussions of results will receive little credit. Your narrative of results should be factual and

specific, so it needs to backed up by fit statistics, coefficient values and significance, etc.

(6) (10 pts.) A short section with final thoughts, conclusions and lessons learned. Business

analytics is about gaining insights from business data for decision making. This is the section

for you to articulate what insights you gained from your analysis. These conclusions must

contain a discussion of:

a) The main conclusions of your analysis. These conclusions must answer/solve your analytics

question/problem stated in 1 above. Please be brief but concise and discuss the main

insights you obtained from your analysis

b) A brief statement of the main issues and challenges you faced in this project and what you

learned from it, including things like: data issues, methodological challenges, do’s and

don’ts, what you learned from this experience. You don’t need to address all of this. But

please be thoughtful and make it interesting.

(7) (10 pts.) Writing Quality, Formatting and Presentation. Analytics projects, no matter how good

they are, are not useful unless the analytics report is well written and clearly articulated.

Nobody wants to see a bunch of statistical output without sound commentary about the

7

results and their implications for business. Consequently, heavy weight will be placed on the

attractiveness, presentation, writing clarity of the report, free of grammatical errors and typos.

More importantly, the entire report needs to flow and be understandable to your audience.

Deliverable 5 (10 pts): Brief Presentation to the Class (5 to 6 slides of content)

Each team will have 10 to 12 minutes or so to share with the class your: business

question/problem; model selection; and conclusions. All presentations must follow this format

(approximately one slide per each bullet):

•

•

•

•

•

•

•

Title slide with project name and team members names

Business problem or analytics question addressed in the study with a short statement of the

business case

Brief description of the dataset (describe any relevant aspects of descriptive statistics,

correlations, visual plot inspections, and pre-processing or transformations, as appropriate)

Brief explanation of your model selection process and alternatives, along with the respective

model specifications.

Discussion of the most relevant results. No need to discuss all results, just important ones.

Final conclusions about implications of your findings

Brief articulation of the challenges you encountered in your project.

Teamwork (10 pts): The instructor will make an assessment of how well the team worked

together. This project is not only about carrying out an analytics exercise, but to get some

experience working as a team, as you will do in your professional work. Some of this grading will

be based on the team as a whole (i.e., how well the team collaborated, distributed assignments

and worked together professionally); and some of it will be individual, based on team evaluations

and the professor’s observations about the fair share and quality contributions of each member.

8

Purchase answer to see full

attachment