In this series of article, we will try to solve a Data Science Business Case by going over the different phases that would make this a successful project. The business case we will focus on is as follows:
Business Case: A financial company wishes to enter the student loan market in the United States. They have a set of datasets which contain information on the loans originated in a given year, the universities in which those loans were taken (such as the number of students, location, student demographics, etc) and the default rates associated with those loans.
As a data scientist our goal is to leverage the available data to help that company enter the market with some competitive advantage (could be location, interest rates offered, etc). In this article, we will see how to structure such a project and how to complete each phase in a way that benefits the company and that increases their chances of success.
Why is it crucial to structure your project?
Before diving into our business case, it might be worth refreshing why having a clear project structure is very important both for the success of your project and for your personal success as a Data Scientist in your company.
Companies can be reluctant to the idea of investing time, effort and money in a data science project before knowing how they will benefit from it and exactly how you are going to get them there as a Data Scientist. Of course, it is unrealistic to lay out a guaranteed path and outcome as any project will come with its set of (un)expected surprises. However, you still need to be able to present them with a project structure that hopefully gets everyone on the same page. Structuring a project comes with several advantages (which are not limited just to data science projects!).
Here are my favourite ones:
1. Allows you to structure your work and be more efficient
2. Lets you manage expectations: what is a realistic outcome and in what timeframe?
3. Breaks down the project into phases where each phase comes with a deliverable.
I would like to put some emphasis on this last point as it is often underrated even though it is, in my opinion, one of the most important ones. As you might know, in data science the outcome of a project is not guaranteed. Sometimes, that predictive algorithm you have been working hard on for the last two months still gives you unreliable results even though you tried everything and followed every step correctly. That’s part of the job, not everything can be modelled. However your boss won’t take that as an option. After all, he didn’t pay you for two months just to hear that your model cannot be used because that data is bad/insufficient, right? That is why you break down your project into phases with deliverables. The key word here is deliverables. Splitting up a project into phases is already natural to many, the question you should ask yourself is:
What can I present at the end of this phase that will be of value to the company?
So, let’s go over an example of project structuring by using the business case described above. We will see how each phase of the project can lead to insights that would be valuable to the company in some way, shape or form.
Phase 1: What can you do with that data?
Ultimately, the question you are trying to answer is: what can I do with that data and how can it benefit the company?
The first part of this question calls on your technical knowledge. In other words, given the data at hand, what analysis’s are possible? what model is feasible? which techniques are appropriate? etc. The second part calls on your business acumen. That is, given what is technically possible with the data at hand, what knowledge can we derive that will benefit the company either directly or indirectly? can it increase profits? can it lead to a new service that clients need? can it optimise internal processes to increase productivity? etc. Answering the first part of the question is what will give a structure to your project. Answering the second part of the question is what will curate that structure and help you produce the deliverables we mentioned previously. However you shouldn’t see this as a sequential process but rather as an iterative process you do throughout the project.
So what would that look like in the context of our business case? Well, we are dealing with a financial company, more precisely with a lending company. Therefore the core of their business and profits will revolve around the loans they produce and the interest rates they can offer. Let’s have a quick look at the data available to see what use cases we can come up with that would allow the company to improve their lending services.
Naturally, the first idea that comes to mind is to use the past data to create a predictive algorithm to predict the default rates. The default rates are a key aspect of the company’s profitability. Predicting the default rate, lets the lending company set a reasonable interest rate which, in turn, ensures that the business is profitable while still remaining competitive in the lending market. With a default rate prediction, the company can also make revenue projections and ensure they are on track with their estimations.
As the company evolves in the space, they will have clients of their own and the data set will get a stream of new data. Eventually, this streaming data can be used to constantly retrain the algorithm. The interest rates and the risk profiles would therefore adapt to offer the best products at all times. The lending market is very competitive and they need to make use of quickly progressing technologies to remain profitable. Finance companies increasingly need to re-invent themselves into software companies that can keep up with the fast pace of the fin-tech world.
Another idea would be to generate some university profiles. University profiling can also be interpreted as defining risk profiles for each university. Since the company is not yet an active lender in the market, they both have the burden and the freedom of finding the right market. Depending on the strategy the company is willing to adopt, it might prefer to spread the risk across several small universities or wish to serve a smaller group of large universities. The size of the university is only one factor amongst many others such as the default rate, the university’s age, the university’s location (which impacts the tax payments), whether the company can open a branch in that location, etc.
The two above ideas can now be reviewed and assessed together with fellow data scientists in the company or with management. This is a rather simple, yet important, step. Everybody gets to agree which idea should be prioritised and put forth before moving along with setting up a plan.
Phase 2: Having a closer look at the data
Something might make a lot of sense on paper and seem like the perfect idea from a business perspective, but once you look at the data you realise there is no chance it is technically achievable. This might come as a gut feeling (through past experiences), but you still have to justify such a conclusion by making use of the data, and it will help defend your case with management. This also holds true when something is technically feasible!
So what is the next step after checking what can be done with the data? Well, it is validating the technical feasibility of the idea. There are several steps here: exploratory analysis, statistical analysis, data pre-processing, etc. Let’s go over them in a bit more details.
Exploratory & Statistical Analysis
In short, this phase fulfils two purposes: validate current hypothesis’s and uncover new insights which can be used to create new hypothesis.
Exploratory analysis is a broad term used to describe the phase in which you take a deeper look at the features in your dataset. You might look at the features distributions (univariate analysis), or at correlations between your features (bi-variate analysis). It is a good idea to include a lot of visualisations here. The goal is to get as familiar as possible with the features of your dataset. You want to look for patterns that will help you out in your future analysis. You also want to communicate those findings in the most effective way possible, which often means making some nice graphs that can more easily be interpreted.
It is also in this phase that you verify the project is feasible. Say you are trying to predict the default rates like we mentioned above, in that case the dependent variable (or predictor) is the default rate but what should your independent variables be? Is there a strong relationship between your dependent variable and your features? If yes, is it linear or non-linear? It is crucial to do this beforehand, you are building your case to show whether it is reasonable to create a predictive model or not. You could save a lot of time and money for your company by doing a proper exploratory analysis.
Note: This phase will be the object of a separate article in this series in which we will do a thorough exploratory analysis of the features of interest and try to get as many insights out of the data.
In the data pre-processing phase you want to deal with missing values, outliers, perhaps feature selection and feature engineering (although this is arguably part of the modelling already). Data pre-processing can be a tedious, sometimes repetitive, yet an absolutely necessary part of your data science project. Not dealing with the above characteristics in your dataset will lead you to wrong, at best, or biased, at worst, results. This phase often takes the most part of the whole project because it can make or break your final model, so don’t take it lightly.
In addition to that, it can be tempting to think that you don’t need to share the work you do in this phase with management. I believe that’s the wrong approach. The decisions you make here, (eg: to include or not certain features, the way you deal with missing values, etc.) will have a huge impact on the final algorithm and everyone involved in the project should be made aware of the reasoning behind each decision. See it is as a way to make your knowledge explicit for the company and ensuring the long term success of data science projects in your company. Moreover, some decisions which might make sense from a technical point of view do not make sense from a business perspective. It is your job to ensure you meet business needs while still being scientific in your approach.
Note: This phase will be the object of a separate article in this series in which we will go in depth with the data pre-processing process to showcase the type of decisions that can be made when handling with the missing values, extreme values etc.
Phase 3: Modelling the data
Following the data exploration and pre-processing we are ready to move on with the modelling phase. This phase will vary a lot from a use case to another. Therefore, there will also be a separate article in this series dedicated to the design and implementation of both the university risk profiling and the default rate predictive modelling. As with the two previous phases I would like to put the emphasis on communication even during this phase. Try to make all your decisions explicit and share any key progress with your team and management.
In conclusion, the structuring of a data science project can be outlined by the three following phases and their deliverables:
- What can you do with the data at hand?
- Contribute to the design of both the short and long term data strategy for the company and help position data science as part of its business
- Document the datasets
- Design a plan for the rest of the project
2. Having a closer look at the data
- Validate current hypothesis provided by domain experts or empirical observations
- Generate new hypothesis by observing patterns in the data and relations between variables
- Verify the theoretical feasibility of the project with a statistical analysis
- Create visualisations to ease the interpretation of your data and findings (this could later evolve into dashboards)
3. Modelling the data
- Come up with client profiles (risk profiles in our business case)
- Improve services (better lending product in our business case)
That’s it for this article, I hope you enjoyed the content! The business side is sometimes an aspect of a data science projects that is overlooked and it is good to remind ourselves as data scientists that every task completed needs to bring some value even in a project where the outcome is uncertain and comes closer to research.
The next article will go over the data-preprocessing (part of Phase 2) in the context of our business case in which we are a data scientist for an upcoming financial company in the US student loan market. Keep updated!