Data Modeling 101 — Explained Like you are 5!
If you want to take only 1 takeaway from this article that would be this: Data Modeling is designing and relating different tables so that the data is more optimized, more understandable, and clean.You do not need to be do it in every scenario.
I have a very contrarian view of data modeling. All the things that I mentioned above are separate from finding insights from the data.
Say for example in one my previous article, I was trying to find out my internet usage behavior using data, it was a flat file. I didn’t need to do data modeling to come up with any insights.
Having said that, we do not use data nowadays to only find insights from the data, it is used in various different ways like data streaming, automation, building database systems, creating business reports, etc.
Basics Terms and their meanings :
Remember we 2 types of tables: Fact and Dimension. Fact tables are those tables that usually change a lot and data is either added or deleted from that table frequently. While dimension tables are those which do not change that much. For eg. we have sales data for a certain location, product, salesperson, and time.
Sales would be our fact table and other tables would be our dimension table.
Based on these criteria we create a relationship between these tables, and how these tables are designed is known as schemas.
Basically, there are 3 types of schemas: Star, Snowflake, and Galaxy schema.(You can simply have one table. eg. in excel and try to find out insights from that table as well.)
In Star schema, data is organized into a fact table surrounded by multiple-dimension tables. The fact table contains the data that is being analyzed, and the dimension tables contain additional information about the data, such as date, product, or location. This schema is useful for data that can be organized into a single, central fact, with multiple dimensions that provide context.
In snowflake schema, the dimensions tables are normalized,( meaning that they are split into multiple smaller tables to remove redundancy and improve data quality). This schema is useful for data that has many dimensions with complex hierarchies and relationships.
Galaxy schema is a hybrid of the star and snowflake schemas. In a galaxy schema, the fact table is surrounded by multiple star schemas, each with its own set of dimension tables. This schema is useful for data that has multiple, independent facts that need to be analyzed together.
Overall, the type of schema used in data modeling will depend on the specific needs of the user or organization, and the nature of the data being analyzed. Different schema types have their own strengths and weaknesses, and the right schema will depend on the particular requirements of the analysis.
Cardinality and filter direction:
In the context of data modeling, relationships refer to the connections between different data tables. These connections are established using common fields, or keys, that exist in both tables.
Creating relationships between tables is important because it allows us to combine data from multiple sources and analyze it together in a single view. For example, if you have a sales data table and a customer data table, you can create a relationship between the two tables using a common field such as customer ID. This will allow you to create visualizations that show sales data for specific customers or compare sales data across different customer segments.
There are 3 types of relationships: one-to-one and one-to-many and many—many. A one-to-one relationship exists when there is a single matching record in both tables for a given key. A one-to-many relationship exists when there are multiple matching records in one table for a given key in the other table. Remember that never use many-many relationships in your data model. (If you are not watching the video: One line solution is to create a bridge table by appending the distinct values from both tables in a single column and joining the tables using the resulting column.)
Filter Direction: How the table will flow from 1 table to another, i.e from fact to dimension or on both sides. (Arrow indicated how the data will flow from 1 table to another (Image below).
Always use the single direction of filtering. If you are not watching the video: One line solution is to use the cross-filter function or if want to use it, never use it on a calendar table.)
Reasons you should learn data modeling:
In my view most important you learn in data modeling is you learn about the data itself. We get so ingrained in the day-to-day jobs that we never learn about the data. (At least I am that.)Data modeling helps you do that.
Other reasons may include :
First, data modeling helps to improve the quality and integrity of data. By organizing data into a logical and consistent structure, data modeling can help to avoid data inconsistencies and errors, and ensure that data is stored in a way that is easy to understand and use. This is particularly important for large and complex datasets, which can be difficult to manage and interpret without a clear and well-defined structure.
Second, data modeling allows users to analyze and interpret data more effectively. By creating a clear and logical representation of data and its relationships, data modeling makes it possible to explore and analyze data in new and meaningful ways. This can provide valuable insights into trends, patterns, and relationships that may not be apparent in the raw data. For example, data modeling can be used to identify trends in sales data, or to segment customers based on their purchasing behavior.
Third, data modeling is essential for building and maintaining database systems. By providing a clear and organized structure for data, data modeling helps to ensure that database systems are efficient and scalable and can support the needs of the user or organization. This is particularly important for large and complex databases, which can require careful planning and design to ensure that they are efficient and effective.
Best practices in data modeling :
1. Never use many-many relationships and bother side filtering.
2. Delete unused columns
3. Avoid calculated columns if you can
4. Naming convention should be in a way that is more understandable to the business.