Schema is defined as the logical structure that governs the architecture and design of the data warehouse and lays out the relationships between entities such as fact and dimension tables. Different data warehouse schemas are widely used as best practices when building a data warehouse. These schemas have proven to be the basis of a robust ETL pipeline that can handle disparate incoming data sources and store them in a data warehouse.
Schemas exist for both databases and data warehouses. For the former,schema generally provides a detailed description and explanation of all the facts and statistics within the database. The schema for a database can be compared to its relational model. In data warehouses, schemas are used to model the relationship between different tables conceptually and define the level of complexity for the data warehouse. The three different data warehouse schemas that are most widely used in enterprises are:
- Star Schema
- Snowflake Schema
- Galaxy Schema
Star schema can be understood as the most basic schema that is used to build a data warehouse. It is the least complex type of schema that can completely define a data warehouse and its entities. The star schema consists of fact tables in the center connected to different dimension tables, similar to the shape of a star, as shown below.
Each dimension in the star schema is represented by only one dimension table. The dimension tables consist of primary keys that become foreign keys in the fact table and link to the different attributes and data in the dimension tables. It is important to note that the dimension tables are not linked to one another, and they are only connected through the fact table.
Data pipeline engineers tend to use star schema because of its simple structure and lower maintenance needs. It is highly optimized in performance and can be applied to data warehouses of all sizes, from EDWs (Enterprise Data Warehouses) to departmental data marts. Dimensional tables in the star schema are highly denormalized, and therefore, star schema saves up valuable disk space for the user. Most BI tools also support the star schema, allowing seamless integration for data warehouses built on this structure.
However, because of its basic structure, there are some cons to using the star schema. Due to the denormalization in dimension tables, there is a possibility of redundancy in data. In addition, the consistency and accuracy of the data might be compromised when using the star schema.
The snowflake schema is more complex and structurally elaborate as compared to the star schema. Here, each fact table is connected to a dimensional table, similar to the star schema. However, each dimensional table in the snowflake schema is further related to one or more sub-dimensional tables. This gives the schema the structure of a snowflake, as shown below.
The snowflake schema normalizes the denormalized dimensional tables in the star schema by breaking them into further sub-dimensions. This saves up vast amounts of disk space and optimizes storage to a greater extent for enterprises. In addition, due to the distinction between tables, there is lesser data redundancy and greater accuracy in the overall data warehouse. It is also easier to update and make changes in the data warehouse when the snowflake schema is used.
However, as the schema becomes more complex and involved, there is a greater need for constant maintenance as more lookup tables are involved. Additionally, query performance is much slower than the star schema due to the significant number of joins implemented to link multidimensional entities.
The galaxy schema (or the fact constellation schema) is perhaps the most complicated schema implemented in a data warehouse.In the galaxy schema, multiple fact tables are connected to the same dimension tables. The conceptual diagram for the galaxy schema looks similar to a constellation of stars, which is the reason for its name. The dimensions that are linked to multiple fact tables are known as conformed dimensions in the galaxy schema. An illustration of the conceptual model for a galaxy schema is given below:
Galaxy schemas are mainly used in data warehouses that contain aggregated fact tables and are very complex structures. The dimensions in the galaxy schema are further divided into more dimensions, similar to the snowflake schema. However, they may also be connected to multiple fact tables, which is the main distinguishing factor between the star andsnowflake schema. Galaxy schemas also have hierarchies within the dimensional tables, which adds to their complexity and sophistication.
The galaxy schema is mainly used to aggregate fact tables and when the data warehouse needs flexibility and versatility. This type of schema has the added benefit of eliminating data redundancy as each dimension is separated through hierarchies and linked throughpre-defined relationships between entities.Lastly, the galaxy schema saves valuable disk space or cloud storage capacity consisting of highly normalized tables.
However, the demerits of the galaxy schema must be acknowledged when considering its implementation. The complex structure and difficulty in maintaining a data warehouse built on the galaxy schema are the main reasons for its scarce usage across enterprises. The greater number of joins used in the schema also slows query performance, and data analysis becomes time-consuming.
While different types of schemas are all offshoots of the conceptual model of the star schema, differences exist in the level of hierarchical organization and normalization of tables. Increasing level of complexity might be required to cater to particular industry or enterprise needs. While the snowflake schema is used to divide dimension into further sub-dimensions, the galaxy schema can be used to aggregate multiple fact tables.
Data pipeline engineers and solutions architects also need to consider the impact of using more complex schemas on query performance and data redundancy. More complicated schemas are generally slower and difficult to set up but promise higher data quality and save up disk space to a greater extent. Thus, deciding on the type of schema that will be implemented to the data warehouse requires extensive planning and research.
For additional information, please visit Local Digital Business.