AM Cheat Sheet
Total Page:16
File Type:pdf, Size:1020Kb
BUILDING A FUTURE PROOF DATA WAREHOUSE TDWI 2007 AMSTERDAM Anchor Modeling FLEXIBILITY Where did it all come from? surrogate key found in the anchor and The environment surrounding a Background exactly one attribute value. It should data warehouse is in constant contain meta information similar to that Anchor Modeling is built upon two found in the anchor. If, for a given iden- change. Anchor modeling is built techniques both discovered in the 1970’s; tity, the attribute may change over time, on this premise, such that a large the sixth normal form and entity rela- it should also contain historization in- change on the outside will result tionships. In more recent years the sixth formation. In the case of the attribute normal form has been discussed with in a small change within. having a state or a type, it can also hold respect to storing temporal data. Also, foreign keys of knots. entity relationships has evolved into en- INDEPENDENCE hanced entity relationships, which adds The model itself is independent semantic modeling concepts such as How do you relate entities? of business logic. Rules are de- typing. Ties scriptive rather than physical to Relationships between entities are mod- increase the longevity of the What building blocks are used? eled as ties between anchors. A tie will Constituents thereby relate identities to each other. data warehouse. You have the The most common form is to relate two power to decide how the data There are only four different types of anchors, but there is no theoretical limit should be interpreted. tables used in anchor modeling. The to how many anchors can be connected three types: anchors, attributes, and ties with a single tie. are sufficient for all modeling needs, but SCOPING for practical reasons a fourth type: knots, A tie must contain the foreign This modular data warehouse is added. It will allow for a simple physi- keys of the adjoining anchors. If, modeling technique supports cal implementation of the semantic con- for the given entities, the relationship separation of concerns and sim- cepts found in enhanced entity relation- may change over time, it should also contain historization information. Fur- plifies project scoping. You can ship modeling. We have introduced our own symbol for knots, but the others are ther, it should also contain meta infor- start small with prototyping and commonly used and taken from entity mation and may have foreign keys to later grow into your enterprise relationship modeling. knots if the relationship has a type or data warehouse. state. Who holds the identities of entities? MODULARITY Anchors What simplifications can be made? Data from different functional An anchor holds the identity of an entity Knots units within a business are in the data warehouse. This identity is In order to simplify the modeling, a stored in the data warehouse as always technical by nature and repre- combination of an anchor and an at- self-contained areas. They can be sented by a surrogate key, rather than the tribute may be assembled into a knot. Knots are used for typing or represent- implemented at different times. natural key. ing states in the data warehouse. Note An anchor must contain a surro- that since it contains both its identity and EXTENSION gate key and should contain meta its value it may never change over time. Every change is implemented as information, such as information relating an identity to the batch and source that A knot contains a surrogate key an independent extension in the generated it. representing its identity as well as existing data warehouse model. an attribute value. It should also contain This means that current applica- Where is actual data stored? meta information in the case that it is tions will not be affected. built from source data, but it may be left Attributes out if it is built by hand. Attributes always belong to an anchor. MAINTAINABILITY They hold actual attribute values that How do you name the tables? Consistent use of a simple mod- can be used to describe the entity whose Naming Conventions eling technique like anchor identity is stored in the anchor. The modeling will yield conformity value stored in an attribute can be of A naming scheme based on prefixes can any data type. Note that the cardinality be used to give a good overview when and increase maintainability of an attribute may be less than that of looking at a database or model. It also across your data warehouses. the anchor. avoids bad models since you will have trouble naming your tables if you are not An attribute must contain a foreign designing them correctly. key referencing the corresponding ANCHOR MODELING — REGARDT AND RÖNNBÄCK WWW.INTELLIBIS.SE BUILDING A FUTURE PROOF DATA WAREHOUSE TDWI 2007 AMSTERDAM CA_Cat Could it have been different? surrogate keys in ascending order and Anchors have a two letter prefix followed Generalization historization columns in descending by an underscore and a descriptive order. That way, if you look at how data We could also have made a generalized camel cased name. is physically structured on your storage model based on an animal anchor, AN, media the latest version for any given key CACOL_CatColor with an animal type attribute, ANTYP. is always found first. Attributes have a five letter prefix, where Then the tie for friendliness would refer the first two letters indicate to which the anchor to itself, ANAN. We would When do you create surrogate keys? anchor it belongs. also have to describe the fact that there CADO_Cat_Dog are colors only valid for cats, like tabby, Identity Management Ties have a four, or 2n, letter prefix built in addition to the model. Business rules Proper identity management is key to up from the prefixes of the adjoining must in general be documented sepa- success in a data warehouse. For anchor anchors. Note the extra underscore in rately from the model and implementa- models there are two ways to achieve it. the descriptive part to stress the fact that tion. To connect the source with the surrogate it is a tie. you can either persistently store the natural key in the warehouse and gener- FRI_Friendliness How do you implement the model? ate new ones from there (late) or you can Knots have a three letter prefix to sepa- Implementation build the connection in your ETL proc- rate them from the other kinds. Note Every object from the logical anchor ess without the need for storing extra that attributes or ties relating to knots do model is implemented as its own table in information (early). so without any change to their prefixes. the database. To ensure that duplicates never can enter the data warehouse each How can I write simpler queries? What does a model look like? table must have a primary key (p below) Collapsing Views Example which guarantees uniqueness. Historiza- tion information must therefore always It is possible to create views that collapse In this simple example we will model a be a part of it. Foreign keys can be de- anchors, attributes and ties into their cat and a dog as separate anchors with a clared to ensure integrity. corresponding third normal form, which typed tie between them. Note that we you can query instead of directly access- can leave out the descriptive part in the CA_Cat CADO_Cat_Dog ing the anchor model. Most query opti- logical model if we make sure that the mizers will figure out which columns you prefixes are unique (recommended). p CA_ID p CA_ID are using and discard all other attribute _metadata p DO_ID tables, even though they are joined in the p CADO_From CAWGT FRI_Friendliness view. The commonly used ones are: FRI_ID p FRI_ID Historically correct view _metadata FRI_Degree This view will keep the historization CAWGT_CatWeight information so that you can analyze CACOL_CatColor p CA_ID changes over time. CA CACOL p CA_ID p CAWGT_From CACOL_Color Latest view CAWGT_Weight This view will find and show only the _metadata latest version of the attribute for any _metadata given anchor identity. FRI Point in time view CADO How do you load data? This view, or usually a table valued func- Zero Update Strategy tion, will take a point in time as an ar- gument and return the latest view with We recommend using only selects and respect to it. inserts when loading the warehouse and never update any rows. This will allow Historization information is not propa- DO DOCOL you to write a simple script using deletes gated into the last two views. A left join to revert faulty data, since there will from the anchor to the attribute with a always be a one-to-one mapping be- subselect to find the latest (max) version tween the data loaded in a batch and normally gives the best performance. actual rows in the database. Streaming DOWGT ETL tools can be used to fill many tables What were the benefits again? with just one scan of the source table. ✦ Historization by design The above model contains information – slowly changing dimensions to rapidly changing relations Are there ways to speed queries up? ✦ Elimination of NULL about cats and dogs in a many-to-many Indexing – one way referential integrity for missing attributes tie, their colors and weights as attributes ✦ Orphan handling as well as how friendly different indi- The best query performance is achieved – early arriving facts can be added without existing parents viduals are with each other as a knot on by creating clustered indexes over the ✦ Separation of concerns the relation. Weight most likely needs to primary keys. Since they are clustered – access control, project scoping, and gradual extension ✦ High performance querying be historized, whereas color does not.