Data Vault Ensemble Modeling Standards
- Data Vault and Ensemble Modeling is based on Core Business Concepts (CBC). These are the fundamental "Person, Event, Thing, Place, Other Concept" used in the day-to-day operations, the running of the business, of the organization. They are the terms spoken, the subjects of emails, the nouns in the questions asked, and the most commonly referenced in the organization.
- Data Vault and Ensemble Modeling models Ensembles based on the Core Business Concepts. The object that we model in these techniques, the pattern for capturing the CBC, is the Ensemble. The Ensemble differs from the Entity or Dimension in that Ensembles contain multiple parts (more than one table).
- Ensembles each contain one instance identifier (some form of key to uniquely identify an instance throughout the enterprise), the relationships that it drives (relationships between it and other Ensembles), and the descriptive context that describes (and depends) on that instance/key.
- The Data Vault Ensemble instance identifier is a Hub. The Hub contains only an unique identifier with the original idea that this would be based on an enterprise-wide unique, natural business key. However since there are rarely any natural business keys that are adequately consistent, and effectively none that would be enterprise-wide unique, the actual uniqueness of a Hub is represented by a multi-part key (concatenated). The primary key of the Hub is a data platform (example: data warehouse) key surrogate such as a sequence id (SID) or it can be the business key itself, or a hash of that key. While Hubs contain Load Date Time and Record Source fields, these are only for metadata purposes (never part of the key) and represent the first time the key was observed, and the first records source that delivered it.
- The Data Vault Ensemble context is modeled in Satellites. There can be several satellites describing a single instance of a Hub. These satellites are designed by taking into consideration the type of data, the rate of change, and, in some cases, the source system (if leap frogging is observed as a problem). Satellites are all pre-built for historical tracking. Satellites are the only construct capable of tracking history as they are the only ones to have a two-part key (including the Load Date Time field).
- The Data Vault Ensemble relationships are modeled in Links. Links are unique, specific, natural business relationships between CBCs (Ensembles). Links instances are driven only by the combination of foreign keys they manage. There is no other part of the key for any Link.
- Data platform (data warehouse) keys. The primary key of the Hub can be a surrogate such as SID, the BK (concatenation) itself, or a hash of that key. Note that each team should make a choice as to which key should be used. There are pros and cons to each method. For example, only the surrogate (such as SID) will leverage the database to maintain the integrity of an ensemble (key dependency). Using the BK (by itself or in the form of a hash), means that the parts of the Ensemble each gain independent identities and so can exist by themselves.
- Data Vault and Ensemble models are designed from business models and not from sources. This is a methodology component however, for the successful use of these patterns, the target should be a business target.