data warehouse assignment solution

The Ultimate Guide to Data Warehouse Design

Data warehouses help you run logical queries, build accurate forecasting models, improve real-time data analysis, and identify trends impacting your organization.

But what goes into designing a data warehouse?

In short here are the 8 steps to data warehouse design:

Gather Requirements: Aligning the business goals and needs of different departments with the overall data warehouse project.
Set Up Environments: This step is about creating three environments for data warehouse development, testing, and production, each running on separate servers
Data Modeling: Design the data warehouse schema, including the fact tables and dimension tables, to support the business requirements.
Develop Your ETL Process: ETL stands for Extract, Transform, and Load. This process is how data gets moved from its source into your warehouse.
OLAP Cube Design: Design OLAP cubes to support analysis and reporting requirements.
Reporting & Analysis: Developing and deploying the reporting and analytics tools that will be used to extract insights and knowledge from the data warehouse.
Optimize Queries: Optimizing queries ensures that the system can handle large amounts of data and respond quickly to queries.
Establish a Rollout Plan: Determine how the data warehouse will be introduced to the organization, which groups or individuals will have access to it, and how the data will be presented to these users.

Whether you choose to use a pre-built vendor solution or to start from scratch, you'll need some level of warehouse design to successfully adopt a new data warehouse and get more from your big data.

What is a data warehouse, 8 steps in data warehouse design, how integrate.io can help, the unified stack for modern data teams, get a personalized platform demo & 30-minute q&a session with a solution engineer.

A data warehouse is a centralized repository where an organization can store substantial amounts of data from multiple source systems and locations. These sources and locations include:

Customer relationship management (CRM) systems
Enterprise resource planning (ERP) systems
Relational databases
Transactional databases

Essentially, data warehouses house all the essential data businesses need to run analyses and collect the valuable business insights reflected in that data. The data warehouse is the ultimate destination supporting business intelligence (BI) activities such as identifying trends and making smarter organizational decisions.

Some of the top benefits of a data warehouse include:

Consistency: Data warehouses compile data from multiple sources and clean it, creating consistency among all data.
Security: A data warehouse provides security because it's a stable, nonvolatile entity that doesn't change over time.
Saves time: A data warehouse helps organizations and individual employees save time by acquiring data within seconds.
Data governance: A data warehouse can make it easier to adhere to data governance guidelines such as GDPR and CCPA rather than relying on legacy systems to safeguard sensitive data.
Removes data silos: A data warehouse removes existing silos in your organization, such as legacy systems in different departments unable to communicate.

Consider this example of a data warehouse in action: you’re trying to figure out the overall value of your leads in Salesforce. You could push your Salesforce data into your data warehouse , set up a schema, and run a query that tells you which marketing activities led to your highest-value prospects. The results of the query outline how to target those high-value prospects more easily with focused marketing materials, move them through your sales funnels, and — ultimately — increase revenue.

Examples of data warehouses include Snowflake, Amazon Redshift, Microsoft Azure, and IBM Db2.

Here are the eight core steps that go into data warehouse design:

1. Defining Business Requirements (or Requirements Gathering)

Data warehouse design is a business-wide journey. Data warehouses touch all areas of your business, so every department needs to be on board with the design. Since your warehouse is only as powerful as the data it contains, aligning departmental needs and goals with the overall project is critical to your success.

So, if you currently can't combine all your sales data with all your marketing data, your overall query results are missing some critical components. Knowing which leads are valuable can help you get more value from your marketing data.

Every department needs to understand the purpose of the data warehouse, how it benefits them, and what kinds of results they can expect from your warehousing solution.

This Requirements Gathering stage should focus on the following objectives:

Aligning departmental goals with the overall project
Determining the scope of the project in relation to business processes
Discovering your current and future needs by diving deep into your data (finding out what data is useful for analysis) and your current tech stack (where your data is currently siloed and not being used)
Creating a disaster recovery plan in the case of system failure
Thinking about each layer of security (e.g., threat detection, threat mitigation, identity controls, monitoring, risk reduction, etc.)
Anticipating compliance needs and mitigating regulatory risks

You can think of this as your overall data warehouse blueprint. But this phase is more about determining your business needs, aligning those to your data warehouse, and, most importantly, getting everyone on board with the data warehousing solution.

Related Reading: What to Consider When Selecting a Data Warehouse for Your Business

2. Setting Up Your Physical Environments

Data warehouses typically have three primary physical environments — development, testing, and production. This mimics standard software development best practices, and your three environments exist on completely separate physical servers.

Why do you need three separate environments?

You need a way to test changes before they move into the production environment.
Some security best practices require that testers and developers never have access to production data.
Running tests against data typically uses extreme data sets or random sets of data from the production environment — and you need a unique server to execute these tests en masse .
Having a development environment is a necessity, and dev environments exist in a unique state of flux compared to production or test environments.
Production environments have much higher workloads ( your whole business is using it ), so trying to run tests or develop in that environment can be stressful for both team members and servers.
Data integrity is much easier to track, and issues are easier to contain when you have three environments running. It makes headhunting issues less stressful on your workloads, and data flow in production and testing environments can be stalled without impacting end users.
Running tests can often introduce breakpoints and hang your entire server. That's not something you want happening in your production environment.
Imagine sharing resources between production, testing, and development. You don’t want that! Testing, development, and production environments all have different resource needs, and trying to combine all functions into one server can be catastrophic for performance.

Remember, BI development is an ongoing process that really never grinds to a halt. This is especially true in Agile /DevOps approaches to the software development lifecycle, which all require separate environments due to the sheer magnitude of constant changes and adaptations.

You can choose to run more than these three environments, and some business users choose to add additional environments for specific business needs. Integrate.io has seen staging environments that are separate from testing solely for quality assurance work, as well as demo and integration environments specifically for testing integrations.

You should have these three core environments, but you can layer in additional settings to fit your unique business goals.

3. Data Warehouse Design: Introducing Data Modeling

Data modeling is the process of visualizing data distribution in your warehouse. Think of it as a blueprint. Before you start building a house, it's important to know what goes where and why it goes there . That's what data modeling is to data warehouses.

Data modeling helps you:

Visualize the relationships between data
Set standardized naming conventions
Create relationships between data sets
Establish compliance and security processes
Align your processes with your overarching IT goals

The above benefits of data modeling help improve decision-making throughout your organization.

However, data modeling is probably the most complex phase of data warehouse design, and there are multiple data modeling techniques businesses can choose from for warehouse design. Before jumping into a few of the most popular data modeling techniques, let's take a look at the differences between data warehouses and data marts:

A data warehouse is a system to store data in (or push data into) to run analytics and queries. A data mart , on the other hand, is an area within a data warehouse that stores data for a specific business function.

So, say you've built your entire data warehouse. That's great! But does it account for how different departments will use the data? Your sales team will use that data warehouse in a vastly different way than your legal team. Plus, certain workflows and data sets are only valuable to certain teams. Data marts are where all those team-specific data sets are stored, and related queries are processed.

Data modeling typically takes place at the data mart level and branches out into your data warehouse. It's the logic behind how you store certain data in relation to other data.

The three most popular data models for warehouses are:

Snowflake schema
Star schema
Galaxy schema

You should choose and develop a data model to guide your overall data architecture within your warehouse. The model you choose will impact the structure of your data warehouse and data marts — which impacts the ways that you utilize ETL tools like Integrate.io and run queries on that data.

Related Reading: Snowflake Schema vs Star Schema

4. Choosing Your Extract, Transform, Load (ETL) Solution

ETL or Extract, Transform, Load is the process used to pull data out of your current tech stack or existing storage solutions and put it into your warehouse. It goes something like this:

You extract data from a source system and place it into a staging area.
You transform that data into the best format for data analytics. You also remove any duplicated data or inconsistencies that can make analysis difficult.
You then load the data to a data warehouse before pushing it through BI tools like Tableau and Looker.

Normally, ETL is a complicated process that requires manual pipeline-building and lots of code. Building these pipelines can take weeks or even months and might require a data engineering team. That’s where ETL solutions come in. They automate many tasks associated with this data management and integration process, freeing up resources for your team.

You should pay careful attention to the ETL solution you use so you can improve business decisions. Since ETL is responsible for the bulk of the in-between work, choosing a subpar tool or developing a poor ETL process can break your entire warehouse. You want optimal speeds, high availability, good visualization, and the ability to build easy, replicable, and consistent data pipelines between all your existing architecture and your new warehouse.

This is where ETL tools like Integrate.io are valuable. Integrate.io creates hyper-visualized data pipelines between all your valuable tech architecture while cleaning and nominalizing that data for compliance and ease of use.

Remember, a good ETL process can mean the difference between a slow, painful-to-use data warehouse and a simple, functional warehouse that's valuable throughout every layer of your organization.

ETL will likely be the go-to for pulling data from systems into your warehouse. Its counterpart Extract, Load, Transfer (ELT) negatively impacts the performance of most custom-built warehouses since data is loaded directly into the warehouse before data organization and cleansing occur. However, there might be other data integration use cases that suit the ELT process. Integrate.io not only executes ETL but can handle ELT, Reverse ETL, and Change Data Capture (CDC), as well as provide data observability and data warehouse insights.

5. Online Analytic Processing (OLAP) Cube

OLAP (Online Analytical Processing) cubes are commonly used in the data warehousing process to enable faster, more efficient analysis of large amounts of data. OLAP cubes are based on multidimensional databases that store summarized data and allow users to quickly analyze information from different dimensions.

Here's how an OLAP cube fits into the data warehouse design:

OLAP cubes are designed to store pre-aggregated data that has been processed from various sources in a data warehouse. The data is organized into a multi-dimensional structure that enables users to view and analyze it from different perspectives.
OLAP cubes are created using a process called cube processing, which involves aggregating and storing data in a way that enables fast retrieval and analysis. Cube processing can be performed on a regular basis to ensure that the data is up-to-date and accurate.
OLAP cubes enable users to perform complex analytical queries on large volumes of data in real-time, making it easier to identify trends, patterns, and anomalies. Users can also slice and dice data in different ways to gain deeper insights into their business operations.
OLAP cubes support drill-down and roll-up operations, which allow users to navigate through different levels of data granularity. Users can drill down to the lowest level of detail to view individual transactions or roll up to higher levels of aggregation to view summary data.
OLAP cubes can be accessed using a variety of tools, including spreadsheets, reporting tools, and business intelligence platforms. Users can create reports and dashboards that display the data in a way that is meaningful to them.

You'll likely need to address OLAP cubes if you're designing your entire database from scratch, or if you're maintaining your own OLAP cube — which typically requires specialized personnel.

So, if you plan to use a vendor warehouse solution (e.g., Redshift or BigQuery ) you probably won't need an OLAP cube (cubes are rarely used in either of those solutions*.)

* Note: some vendor solutions will let you build OLAP cubes on top of Redshift or BigQuery data marts, but Integrate.io can't recommend any since it has never used them personally.

If you have a set of BI tools requiring an OLAP cube for ad-hoc reporting, you may need to develop one or use a vendor solution.

OLAP Cubes vs. Data Warehouse

Here are the differences between a data warehouse and OLAP cubes:

A data warehouse is where you store your business data in an easily analyzable format to be used for a variety of business needs.

Online Analytic Processing cubes help you analyze the data in your data warehouse or data mart. Most of the time, OLAP cubes are used for reporting, but they have plenty of other use cases.

Since your data warehouse will have data coming in from multiple data pipelines, OLAP cubes help you organize all that data in a multi-dimensional format that makes analyzing it rapid and straightforward. OLAP cubes are a critical component of data warehouse design because they provide fast and efficient access to large volumes of data, enabling users to make informed business decisions based on insights derived from the data.

You may require custom-built OLAP cubes, or you may need to hire support to help you maintain your cubes.

These resources on OLAP cubes can help you dig deeper:

Overview of Service Manager OLAP cubes for advanced analytics (Microsoft)

OLAP Cubes ( OLAP.com )

Understanding Cubes (Oracle)

6. Data Warehouse Design: Creating the Front End

So far, this guide has only covered back-end processes. There needs to be front-end visualization, so users can immediately understand and apply the results of data queries.

That's the job of your front end. There are plenty of tools on the market that help with visualization. BI tools like Tableau (or PowerBI for those using BigQuery) are great for visualization. You can also develop a custom solution — though that's a significant undertaking.

Most small-to-medium-sized businesses lean on established BI kits like those mentioned above. But, some businesses may need to develop their own BI tools to meet ad-hoc analytic needs. For example, a Sales Ops manager at a large company may need a specific BI tool for territory strategies. This tool would probably be custom-developed given the scope of the company’s sales objectives.

You should pay keen attention to reporting during this stage. How often does reporting need to be done? Do you need each person to create their own reports? Questions like these should guide you to a BI toolkit that fits your unique requirements.

Pro-tip : Keep it simple. Your employees don't care about most of the fancy features or deep complexities. They just want something that works for them and makes their lives easier.

7. Optimizing Queries

Optimizing queries is a critical part of data warehouse design. One of the primary goals of building a data warehouse is to provide fast and efficient access to data for decision-making. During the design process, data architects need to consider the types of queries that users will be running and design the data warehouse schema and indexing accordingly.

Optimizing your queries is a complex process that's hyper-unique to your specific needs. But there are some general rules of thumb.

We heavily recommend the following during database design:

Ensure your production, testing, and development environments have mirrored resources. This mirroring prevents the server from hanging when you push projects from one environment to the next.

Try to minimize data retrieval. Don't run SELECT on the whole database if you only need a column of results. Instead, run your SELECT query by targeting specific columns. This is especially important if you're paying for your query power separately.

Understand the limitations of your OLAP vendor. BigQuery uses a hybrid SQL language, and RedShift is built on top of a Postgre fork. Knowing the little nuances baked into your vendor can help you maximize workflows and speed up queries.

8. Establishing a Rollout Plan

Once you're ready to launch your warehouse, it's time to start thinking about education, training, and use cases. Most of the time, it will be a week or two before your end-users start seeing any functionality from that warehouse (at least at scale). But they should be adequately trained in its use before the rollout is completed.

A rollout plan typically includes the following steps:

Identifying the target audience: This involves determining which groups or individuals within the organization will benefit from using the data warehouse.
Determining the data requirements : This involves identifying the types of data that the target audience needs access to and ensuring that this data is available within the data warehouse.
Developing user-friendly interfaces: This involves creating user interfaces that are intuitive and easy to use, and that provide users with the ability to interact with the data in meaningful ways.
Testing and refining: This involves conducting user testing to ensure that the data warehouse meets the needs of its users, and making adjustments as necessary.
Training users: This involves providing training and support to users to help them understand how to use the data warehouse effectively.
Deploying the data warehouse: This involves introducing the data warehouse to its intended users, and ensuring that the rollout process goes smoothly.

By establishing a rollout plan, organizations can ensure that their data warehouse is introduced effectively and that users are able to make the most of the valuable data that it contains.

Congratulations! You're ready to embark on your data warehouse design!

These are the core components of data warehouse design. But remember, your business may have different steps that aren't included in this list. Every data warehouse is different.

Integrate.io can simplify data warehouse integration by moving data from a source to a warehouse such as Redshift or Snowflake without lots of code or data engineering. This data pipeline platform’s pre-built native connectors simplify the data integration process so you can focus on other tasks in your business.

Integrate.io’s philosophy is to streamline data integration and make it easier to move data between locations. Schedule a demo now!

Tags: data warehouse, design, how-to

Why Data Democratization Matters Today

Snowpark Unleashed: Data Magic Within Snowflake

The Essential Role of a Data Steward in Modern Business Intelligence

Subscribe to the stack newsletter.

[email protected] +1-888-884-6405

Solutions Home
Release Notes
Support & Resources
Documentation
Documentation API
Service Status
Privacy Policy
Terms of Service
Consent Preferences
White Papers

Get the Integrate.io Newsletter

Choose your free trial, etl & reverse etl, formerly xplenty.

Low-code ETL with 220+ data transformations to prepare your data for insights and reporting.

Formerly FlyData

Replicate data to your warehouses giving you real-time access to all of your critical data.

API Generation

Formerly dreamfactory.

Generate a REST API on any data source in seconds to power data products.

15 Data Warehouse Project Ideas for Practice with Source Code

Learn how data is processed into data warehouses by gaining hands-on experience on these fantastic solved end-to-end real-time data warehouse projects.

The worldwide data warehousing market is expected to be worth more than $30 billion by 2025. Data warehousing and analytics will play a significant role in a company’s future growth and profitability. Data warehouse solutions will provide every business a considerable advantage by evaluating all of the data they collect and making better decisions. Understanding business data will help make intelligent business decisions that determine whether an organization succeeds or fails. The demand for Big Data and Data Analytics will continue to grow in the coming days, leading to a greater need for Data Warehouse solutions.

ProjectPro Free Projects on Big Data and Data Science

It’s essential to understand why data warehousing projects fail before getting an idea of the different data warehousing projects to explore from beginner to advanced level in your learning path. So let's get started!

What is data warehousing, why data warehouse projects fail, data warehouse projects for beginners, data warehouse projects for intermediate, data warehouse projects for advanced, data warehouse project tools.

Data warehousing (DW) is a technique of gathering and analyzing data from many sources to get valuable business insights. Typically, a data warehouse integrates and analyzes business data from many sources. The data warehouse is the basis of the business intelligence (BI) system, which can analyze and report on data.

GCP Project to Learn using BigQuery for Exploring Data

Downloadable solution code | Explanatory videos | Tech Support

To put it in other words, Data Warehousing supports a set of frameworks and tools that help businesses organize, understand, and use their data to make strategic decisions.

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

The significant roadblocks leading to data warehousing project failures include disconnected data silos, delayed data warehouse loading, time-consuming data preparation processes, a need for additional automation of core data management tasks, inadequate communication between Business Units and Tech Team, etc.

Delayed Data Warehouse Loading

Data must first be prepared and cleaned before being placed into the warehouse. Cleaning data is typically time-consuming, so this creates an immediate crisis. IT professionals are often disappointed by the time spent preparing data for loading. The ability of enterprises to quickly move and combine their data is the primary concern. Movement and ease of access to data are essential to generating any form of insight or business value. This often exhausts an organization's time and resources, resulting in a more protracted and expensive project in the end. Furthermore, poor data loading might result in various issues, including inaccuracies and data duplication.

Lower End-User Acceptance Rate

End-user acceptability is another factor that frequently leads to the failure of data warehouse projects. New technologies can be fascinating, but humans are afraid of change, and acceptance may not always be the case. Any project's success depends on how well people are mutually supportive. The first step in encouraging user acceptance and engagement is to create a data-driven mindset. End users should be encouraged to pursue their data-related interests. Non-technical users will benefit from self-service analytics because it will make it easier to access information fast. These transitional efforts will aid the success and utilization of your data warehouse in the long run and lead to better decision-making throughout the organization.

Automation of core management activities

If you carry out a process manually, valuable time, resources, and money are invested instead of automating it, thereby wasting business opportunities. You can automate manual, time-consuming operations, which helps you save money while shortening the time to see results. Automation can accelerate all data management and data warehousing steps, including data collection, preparation, analysis, etc.

Get Closer To Your Dream of Becoming a Data Scientist with 150+ Solved End-to-End ML Projects

15 Data Warehouse Project Ideas for Practice

This section will cover 15 unique and interesting data warehouse project ideas ranging from beginner to advanced levels.

From Beginner to Advanced level, you will find some data warehouse projects with source code, some Snowflake data warehouse projects, some others based on Google Cloud Platform (GCP), etc.

Here's what valued users are saying about ProjectPro

Savvy Sahai

Data Science Intern, Capgemini

Abhinav Agarwal

Graduate Student at Northwestern University

Not sure what you are looking for?

Snowflake Real-time Data Warehouse Project

In this Snowflake Data Warehousing Project, you'll learn how to deploy the Snowflake architecture to build a data warehouse in the cloud. This project will guide you on loading data via the web interface, SnowSQL, or Cloud Provider. You will use Snowpipe to stream data and QuickSight for data visualization .

Source code- Snowflake Real-time Data Warehouse Project

Slowly Changing Dimensions Implementation using Snowflake

This project depicts the usage of Snowflake Data Warehouse to implement several SCDs. Snowflake offers various services that help create an effective data warehouse with ETL capabilities and support for various external data sources. Use Python's faker library to generate user records and save them in CSV format with the user's name and the current system time for this project. Fake data is made with the faker library and saved as CSV files. NiFi is used to collect data, and Amazon S3 sends the data. New data from S3 is loaded into the staging table using a Snowpipe automation tool. Data manipulation language changes are stored in the staging table using Snowflake streams to determine the operation to be done. Initiate tasks and stored procedures depending on the changes to implement SCD Type-1 and Type-2.

Source Code- Slowly Changing Dimensions Implementation using Snowflake

New Projects

Fraud Detection using PaySim Financial Dataset

In today's world of electronic monetary transactions, detecting fraudulent transactions is a significant business use case. To overcome this issue, PaySim Simulator is used to create Synthetic Data available on Kaggle. The data contains transaction specifics such as transaction type, transaction amount, client initiating the transaction, old and new balance, i.e., before and after the transaction, and the same in Destination Account along with the target label, and is fraudulent. This data warehouse project uses the PaySim dataset to create a data warehouse and a classification model based on transaction data for detecting fraudulent transactions.

Source Code- Fraud Detection using PaySim Financial Dataset

Anime Recommendation System Data Warehouse Project

The anime recommendation system is one of the most popular data warehousing project ideas. Use the Anime dataset on Kaggle, which contains data on user preferences for 12,294 anime from 73,516 people. Each user can add anime to their completed list and give it a rating. The project aims to develop an effective anime recommendation system based on users' viewing history. Use the Anime dataset to build a data warehouse for data analysis. Once the data has been collected and analyzed, it becomes ready for building the recommendation system.

Source code- Anime Recommendation System Data Warehouse Project

Marketing Data Warehouse for Media Research Company

Customer relationship management and sales systems, for example, might cause marketing data to get diffused across various systems within an organization.

Create a marketing data warehouse for this project, which will serve as a single source of data for the marketing team to work with. You can also combine internal and external data like web analytics tools, advertising channels, and CRM platforms. Use the Nielsen Media Research company dataset for building this data warehouse. All marketers will access the same standardized data due to the data warehouse, allowing them to execute faster and more efficient projects. Such data warehouses enable organizations to understand performance measures, including ROI, lead attribution, and client acquisition costs.

Source Code- Marketing Data Warehouse for Banking Dataset

Data Warehouse Design for E-commerce Environments

You will be constructing a data warehouse for a retail store in this big data project. However, it concentrates on answering a few particular issues about pricing optimization and inventory allocation in terms of design and implementation. In this hive project, you'll be attempting to answer the following two questions:

Were the higher-priced items more prevalent in some markets?

Should inventory be reallocated or prices adjusted based on location?

Source Code- Data Warehouse Design for E-commerce Environments

Get FREE Access to Machine Learning Example Codes for Data Cleaning , Data Munging, and Data Visualization

Data Warehouse Project for Music Data Analysis

This project involves creating an ETL pipeline that can collect song data from an S3 bucket and modify it for analysis. It makes use of JSON-formatted datasets acquired from the s3 bucket. The project builds a redshift database in the cluster with staging tables that include all the data imported from the s3 bucket. Log data and song data are the two datasets used in the project. The song_data dataset is a part of the Million Song Dataset , and the log_data dataset contains log files generated based on the songs in song_data. Data analysts can use business analytics and visualization software to understand better which songs are most popular on the app.

Source Code- Data Warehouse Project for Music Data Analysis

Global Sales Data Warehouse Project

The primary goal of this Global Sales Data Warehouse project is to minimize raw material manufacturing costs and enhance sales forecasting by identifying critical criteria such as total sales revenue on a monthly and quarterly basis by region and sale amount. The Data Warehousing Project focuses on assessing the entire business process. The data warehouse provides essential information such as daily income, weekly revenue, monthly revenue, total sales, goals, staff information, and vision.

Source Code- Sales Data Warehouse Project

Data Warehouse Project for B2B Trading Company

This project aims to employ dimensional modeling techniques to build a data warehouse. Determine the business requirements and create a data warehouse design schema to meet those objectives. Using SSRS and R, create reports using data from sources. Based on the data warehouse, create an XML schema. Use Neo4j technologies to design a data warehouse section as a graph database.

Source Code- Data Warehouse Project for B2B Trading Company

Heart Disease Prediction using Data Warehousing

One of the most commonly seen diseases today is heart disease. In this data warehousing project, you'll learn how to create a system that can determine whether or not a patient has heart disease. The data warehouse assists in correlating clinical and financial records to estimate the cost-effectiveness of care. Data mining techniques aid in identifying data trends that may anticipate future individual heart-related issues. Furthermore, the data warehouse aids in the identification of individuals who are unlikely to respond well to various procedures and surgeries.

Source Code- Heart Disease Prediction using Data Warehousing

Access Job Recommendation System Project with Source Code

GCP Data Ingestion using Google Cloud Dataflow

Data ingestion and processing pipeline on Google cloud platform with real-time streaming and batch loading are part of the project. This project uses the Yelp dataset, primarily used for academic and research reasons. We first create a GCP service account, then download the Google Cloud SDK. In subsequent operations, the Python program and all other dependencies are then downloaded and connected to the GCP account. It downloads the Yelp dataset in JSON format, connects to Cloud SDK through Cloud storage, and connects to Cloud Composer. It publishes the Yelp dataset JSON stream to a PubSub topic. Cloud composer and PubSub outputs connect to Google Dataflow using Apache Beam . Lastly, Google Data Studio is used to visualize the data.

Source Code- GCP Data Ingestion using Google Cloud Dataflow

Explore Categories

Build Data Pipeline using Dataflow, Apache Beam, Python

This is yet another intriguing GCP project that uses PubSub, Compute Engine, Cloud Storage, and BigQuery. We will primarily explore GCP Dataflow with Apache Beam in this project. The two critical phases of the project are-

Reading JSON encoded messages from the GCS file, altering the message data, and storing the results to BigQuery.

Reading JSON-encoded Pub/Sub messages, processing the data, and uploading the results to BigQuery.

Source Code- Build Data Pipeline using Dataflow, Apache Beam, Python

In this next advanced-level project, we will mainly focus on GCP BigQuery. This project will teach you about Google Cloud BigQuery and how to use Managed Tables and ExternalTables. You'll learn how to leverage Google Cloud BigQuery to explore and prepare data for analysis and transformation. It will also cover the concepts of Partitioning and Clustering in BigQuery. The project necessitates using BQ CLI commands and creating an External BigQuery Table using a GCS Bucket, and it uses Client API to load BigQuery tables.

Source Code- GCP Project to Learn using BigQuery for Exploring Data

Anomaly Detection in IoT-based Security System

IoT devices, or network-connected devices like security cameras, produce vast amounts of data that you may analyze to improve workflow. Data is collected and stored in relational formats to facilitate historical and real-time analysis. Then, using existing data, instant queries are run against millions of events or devices to find real-time abnormalities or predict occurrences and patterns. For this project idea, create a data warehouse that will help this data be consolidated and filtered into fact tables to provide time-trended reports and other metrics.

Source Code- Anomaly Detection in IoT-based Security System

AWS Snowflake Data Pipeline using Kinesis and Airflow

This project will show you how to create a Snowflake Data Pipeline that connects EC2 logs to Snowflake storage and S3 post-transformation and processing using Airflow DAGs . Send customers' data and orders data to Snowflake via Airflow DAG processing and transformation and S3 processed stages in this project. You'll learn how to set up Snowflake stages and create a database in Snowflake.

Source Code- AWS Snowflake Data Pipeline using Kinesis and Airflow

Data warehousing optimizes ease of access, reduces query response times, and enables businesses to gain deeper insights from large volumes of data. Previously, building a data warehouse required a significant investment in infrastructure. The introduction of cloud technology has drastically cut the cost of data warehousing for enterprises.

There are various cloud-based data warehousing tools now available in the market. These tools provide high speed, high scalability, pay-per-use, etc. Since choosing the best Data Warehouse tool for your project can often seem challenging, we have curated a list of the most popular Data Warehouse project tools with their essential features-

Check Out Top SQL Projects to Have on Your Portfolio

Microsoft Azure

Microsoft's Azure SQL data warehouse is a cloud-based relational database. Microsoft Azure allows developers to create, test, deploy, and manage applications and services using Microsoft-managed data centers. The platform is based on nodes and uses massively parallel computing (MPP). The design is well suited for query optimization for concurrent processing. As a result, you can extract and visualize business information considerably more quickly. Azure is a public cloud computing platform that provides IaaS, PaaS, SaaS, among other services.

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Snowflake is a cloud-based data warehousing platform that runs on Amazon Web Services ( AWS ) or Microsoft Azure cloud architecture. You can use Snowflake to create an enterprise-grade cloud data warehouse. You can use the tool to gather and analyze data from both structured and unstructured sources. It uses SQL to perform data blending, analysis, and transformations on various data structures. Snowflake provides scalable, dynamic computing power at per-usage cost, and it enables you to scale CPU resources following user activity.

Google BigQuery

BigQuery is a cost-efficient serverless data warehouse with built-in machine learning features. It's a platform for ANSI SQL querying. Google BigQuery is a data analysis tool that allows you to process read-only data sets in the cloud and works with SQL-lite syntax to analyze data with billions of rows. You can use it in conjunction with Cloud ML and TensorFlow to build robust AI models . It can also run real-time analytics queries on vast amounts of data in seconds. This cloud-native data warehouse supports geospatial analytics.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Amazon Redshift

The Amazon Redshift is a cloud-based, fully managed data warehouse. In seconds, the fully managed system can process vast amounts of data. As a result, it's well-suited to high-speed data analytics. Because it is a relational database management system (RDBMS), you can use it with other RDBMS applications. Using SQL-based clients and business intelligence (BI) tools with typical ODBC and JDBC connections, Amazon Redshift facilitates quick querying abilities over structured information. Also, Redshift supports automatic concurrent scaling, and the automation scales up or down query processing resources to match workload demand. You may also scale your cluster or switch between node kinds with Redshift. As a result, you can improve data warehouse performance while lowering operational costs.

Start Building Data Warehousing Projects to Get You a Real-World Data Job

As organizations explore new opportunities and products, data warehouses play a vital role in the process. They're rapidly evolving; especially cloud data warehouses are becoming popular among businesses. They assist companies in streamlining operations and gaining visibility across all areas. Furthermore, cloud data warehouses assist businesses in better serving their clients and expanding their market potential. This makes it even more crucial for data engineers to enhance their data warehousing skills and knowledge to stay ahead of the competition. If we’ve whetted your appetite for more hands-on real-time data warehouse project ideas, we recommend checking out ProjectPro for Solved End-To-End Big Data and Data Warehousing Projects.

FAQs on Data Warehousing Projects

What is etl in data warehouse.

ETL (extract, transform, and load) is a data integration process that integrates data from several sources into a single, reliable data store that is then loaded into a data warehouse or other destination system.

How to define business objectives for data warehousing projects?

For any data warehousing project, here are a few things you must keep in mind-

the scope of the project,
a data recovery plan,
compliance needs and regulatory risks,
the data warehouse's availability in production,
plan for future and current needs, etc.

Access Solved Big Data and Data Science Projects

About the Author

Daivi is a highly skilled Technical Content Analyst with over a year of experience at ProjectPro. She is passionate about exploring various technology domains and enjoys staying up-to-date with industry trends and developments. Daivi is known for her excellent research skills and ability to distill

User policy

Write for ProjectPro

Data Warehousing Fundamentals for IT Professionals by

Get full access to Data Warehousing Fundamentals for IT Professionals and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

ANSWERS TO SELECTED EXERCISES

CHAPTER 1 THE COMPELLING NEED FOR DATA WAREHOUSING

1 Match the columns

1-D, 2-G, 3-I, 4-F, 5-H, 6-A, 7-J, 8-E, 9-B, 10-C

CHAPTER 2 DATA WAREHOUSE: THE BUILDING BLOCKS 1 Match the columns

1-H, 2-G, 3-J, 4-F, 5-B, 6-I, 7-C, 8-A, 9-E, 10-D

CHAPTER 3 TRENDS IN DATA WAREHOUSING

1 Indicate if true or false

A-T, B-F, C-T, D-F, E-F, F-T, G-F, H-T, I-F, J-F

CHAPTER 4 PLANNING AND PROJECT MANAGEMENT

1-D, 2-E, 3-G, 4-H, 5-F, 6-I, 7-J, 8-A, 9-C, 10-B

CHAPTER 5 DEFINING THE BUSINESS REQUIREMENTS

A-F, B-T, C-F, D-T, E-T, F-F, G-F, H-F, I-F, J-T

CHAPTER 6 REQUIREMENTS AS THE DRIVING FORCE FOR DATA WAREHOUSING

1-H, 2-F, 3-J, 4-A, 5-I, 6-B, 7-E, 8-D, 9-G, 10-C

CHAPTER 7 THE ARCHITECTURAL COMPONENTS

A-F, B-T, C-F, D-F, E-F, F-F, G-T, H-F, I-T, J-F

CHAPTER 8 INFRASTRUCTURE AS THE FOUNDATION FOR DATA WAREHOUSING

1-F, 2-D, 3-J, 4-A, 5-G, 6-E, 7-B, 8-C, 9-H, 10-I

CHAPTER 9 THE SIGNIFICANT ROLE OF METADATA

A-F, B-T, C-F, D-F, E-T, F-T, G-F, H-T, I-T, J-F

CHAPTER 10 PRINCIPLES OF DIMENSIONAL MODELING

1-F, 2-I, 3-G, 4-A, 5-J, 6-B, 7-E, 8-C, 9-D, 10-H

CHAPTER 11 DIMENSIONAL MODELING: ADVANCED TOPICS

A-T, B-F, C-T, D-F, E-T, F-F, G-T, H-T, I-F, J-T

CHAPTER 12 DATA EXTRACTION, TRANSFORMATION, AND LOADING

1-E, 2-G, 3-B, 4-F, 5-J, 6-A, 7-C, 8-H, 9-D, 10-I

CHAPTER 13 DATA ...

Get Data Warehousing Fundamentals for IT Professionals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

15+ Data Warehouse Interview Questions & Answers to Prepare For

best data warehouse interview questions to practice for data warehousing jobs

Preparing for that make-or-break assessment? Brush up with these data warehouse interview questions and answers to land that dream job!

Using information technology, important data may be retrieved in the most efficient manner, which is vital in today’s competitive market. The data warehouse has become more crucial in businesses to store information about a company’s daily operations in databases.

This knowledge and data are useful. As a result of the decision-making that takes place throughout the process, it may be vital to the success of companies. That’s why it’s so critical to know what a data warehouse is.

I’ve compiled a list of often-asked questions to aid you in your job interview. This article is divided into:

Basic Data Warehouse Interview Questions

Intermediate data warehouse interview questions, advanced data warehouse interview questions.

Looking to hire the best remote developers? Explore HireAI to see how you can:

⚡️ Get instant candidate matches without searching ⚡️ Identify top applicants from our network of 300,000+ devs with no manual screening ⚡️ Hire 4x faster with vetted candidates (qualified and interview-ready)

Try HireAI and hire top developers now →

1. What is a data warehouse?

If you’re looking for a way to keep track of a bank’s activity records, for example, a data warehouse is a place to go. In the 1980s, it emerged as a notion to inspire corporate data systems from academic institutions. This approach has developed over time and may now be found in various BI systems.

The data warehouse makes it possible to organize and have a data warehouse, making it simpler to develop reports and analyze vast volumes of data and get insights, which may be crucial in the decision-making process of a firm. The system uses data mining to integrate information from many systems scattered across an organization. As a result, queries, reports, and analyses for decision-making are arranged in many tiers.

2. What’s the difference between a data warehouse and data mining?

The difference between data warehouse and data mining is a basic difference to understand. The candidate should be able to explain each with ease when asked by the interviewer.

What is data mining?

Data mining is a technique for uncovering information that you didn’t know you had. You can only obtain known information from the data using the standard query method. Data mining, on the other hand, gives you the ability to uncover previously unknown information. Through the process of data mining, it is possible to get useful information from databases.

Known as KDD, knowledge discovery in databases shows patterns and relationships. An object’s characteristics may also form a connection with another object. One of the results of data mining is called a pattern because it shows a regular and easy-to-follow sequence of data.

For the sake of a quick recap, the first stage in data mining on a large dataset is called “selection.” Remove inconsistencies from the data before going on to preprocessing. Next, the data is translated into a format that can be used for data mining. In the next step, data mining techniques are used to analyze the data. Finally, the process of interpreting and evaluating data includes identifying patterns or relationships among the information.

What is a data warehouse?

Information gathered from diverse sources and kept in a single, uniform structure may be found at the Data Warehouse, a central place. Data is first obtained, then cleansed, converted, and stored in a data warehouse from a variety of corporate sources. It is possible to access data in a data warehouse for an extended length of time after it is first stored there.

Data modeling, data collecting, data administration, metadata management, and development tool store management are all part of the data warehouse. These technologies can help with data extraction, data transformation, data warehousing, and user interfaces, as well as many other things.

Main difference

In order to distinguish between data mining and data warehousing, one must first perform data extraction, the process by which useful information is gleaned from a large database or warehouse. As a result of this, data mining can be done more quickly and effectively with the help of a data warehouse, which houses all the data in one place.

Read More : 8 Common Interview Mistakes Remote Software Developers Make

3. Data warehouses provide what advantages?

A data warehouse (DW), like any business tool, has its advantages and disadvantages compared to other alternatives. The candidate must demonstrate this knowledge, as well as be able to understand at what point his company will need each solution in case the interviewer questions him.

For those who wish to implement business intelligence initiatives, a DW is a necessary tool. However, it also has additional benefits. Here are a few examples:

Data centralization : When a corporation relies on a DW, all of its data is housed in a single location. Data can be found more rapidly in this manner, and data discrepancies may be eliminated. The organization also faces the danger of missing out on all the data that is being created all the time without this technology.
Analysis made easier : Data warehouses also make the process of evaluating databases simpler since they give a larger picture of the information set. This avoids errors and gives data that may be used to identify patterns and make strategic decisions.
Simplified access : Performing this data analysis often necessitates the involvement of many members of the company’s staff. As a result, it is critical that access be made as simple as possible. Allowing quick access to authorized individuals is another benefit of DW.
Reporting : Another benefit is that reports may be generated without the assistance of programmers with specific knowledge. Because Data Warehouse products help non-technical users manipulate and produce consistent results, the workflow is made easier by this fact.

4. What is an aggregate table?

Aggregation is the collection and presentation in summary form of data for statistical analysis and the achievement of commercial goals. If your candidate wants a complete picture of your company’s operations, they need to be able to look through huge amounts of raw data.

Your candidate should also be familiar with the term “data aggregation.” In the context of statistical analysis, data aggregation refers to the collection and presentation of data in a summary form. The most common reason for data aggregation is to learn more about certain demographics, such as age, occupation, or income.

What are the advantages of constructing aggregate or roll-up tables in this regard?

For the queries you still have, you may pre-aggregate your older data using rollup tables. You may then discard or roll off older data to slower storage, saving space and computational resources. You no longer need to keep all of the older data.

Read More : 8 Behavioral Interview Questions Asked by Top Tech Companies

5. What do you understand about Star Schema?

A clear understanding of the Star Schema is essential in developing a data warehouse, as it involves building complex databases as well. The candidate should elaborate on the advantages and disadvantages of this approach.

As part of his vision for database modeling for decision-support systems, Dr. Ralph Kimball came up with the Star Schema idea. Performance is boosted due to the availability of a lot of duplicated data.

A data warehouse’s star schema is one of many data modeling approaches.

Dimensional tables are connected to fact tables to model the data. For each event, the dimensional tables keep track of its specifics. Events that have happened and their accompanying dimensional table attributes are stored in the fact table.

The name was chosen because of the model’s resemblance to a celestial body. The fact table sits in the “heart” of the star, surrounded by auxiliary tables referred to as dimensions. Multiple joins link the fact table to the other dimensions, but just one join connects the dimension tables to the fact table.

an all-in-one fact table that contains all of the relevant information;
each dimension has its own table;
only one primary key exists for each dimension in the fact table;
each key is generated.

An individual, highly-denormalized table may be represented by each dimensionality.

Many people like this because it is easy to understand, has fewer joins, and needs less upkeep.

Disadvantages

There is a lot of redundancy in dimension tables since they are not standardized. A simple change, like the name of a country, might require a lot of database changes, making it unsuitable for transactions (for all the rows of municipalities).

6. Explain what dimension tables and fact tables are.

Fact and dimension tables are extremely important components when building a data warehouse. Knowing what is recommended and what to avoid is also a good idea when the interviewer asks the candidate.

When creating a data warehouse, fact and dimension tables are the two most important tables. Surrogate keys connect the fact table to the dimension tables, which provide measurements for the columns.

Measure columns are the data that you maintain in order to measure business facts. Measure columns in a data warehouse could include things like sales revenue or shipment volume. The dimension characteristics are used to examine these metrics. Because of this, a fact table should have surrogate keys for dimension tables.

Text columns should be avoided in fact tables. With regard to fact tables, for example, status columns are common. Ultimately, this method will result in many status columns in the fact table. Data warehouse designers might do better by moving all status information into a “junk-dimension” and using another surrogate key to link combinations with fact tables.

However, there may be instances where columns like “Order number” must be stored in the fact tables. For debugging reasons, these columns should be placed in the fact table, rather than the audit table.

Read More : 10+ Tips for Preparing for a Remote Software Developer Zoom Interview

Struggling with interview prep? Meet senior developers from Amazon, Microsoft, and Google now on Codementor. They’ll help you tackle coding challenges, practice interviews, and sharpen your skills in live 1:1 sessions. Book a session with our interview prep experts today! Your first 15 minutes are free.

Explore our other software development interview questions and answers to prep for your next remote job.

JavaScript Interview Questions
Machine Learning Interview Questions
MongoDB Interview Questions
TypeScript Interview Questions
Selenium Interview Questions
Spring Interview Questions
Data Engineer Interview Questions
React Interview Questions
Data Analyst Interview Questions
Vue Interview Questions
SQL Interview Questions
DevOps Interview Questions
Engineering Manager Interview Questions
Java Interview Questions
PHP Interview Questions
Ruby on Rails Interview Questions
Angular Interview Questions
Android Interview Questions
Data Warehouse Interview Questions

1. What are the advantages of having a cloud-based data warehouse?

Many companies have different architectures for storage solutions: a local data warehouse or a cloud-based data warehouse both have their advantages and disadvantages. The candidate should demonstrate knowledge in the decision process in choosing the most suitable tools.

With on-premises storage, analytics and business intelligence (BI) activities are less efficient than they are with cloud data warehouses. On-premises systems have the added drawback of having restricted scalability and higher upgrade expenses.

Using a legacy data warehouse, on the other hand, puts you at risk in a number of areas, including security and compliance. Cloud-based data warehousing, on the other hand, may be able to solve many of these problems for forward-thinking businesses.

Because of the increasing need for analytics, traditional data storage architectures can no longer meet these demands. A forecasted $3.5 billion market for cloud data warehousing proves that conventional data warehouses cannot give enterprises the speed, scalability, and agility they are searching for.

Data warehouses hosted in the cloud tend to be less time-consuming to set up than those hosted on-premises. Cloud-based data warehouses assure compliance and may be more secure than historical data warehouses since they were built with the ever-changing needs of enterprises in mind. Companies no longer have to choose between on-site and off-site storage options when they need to manage a lot of data. Cloud DWHs have changed that.

User-friendliness

Accessibility and adaptability are lauded as virtues in today’s workplaces. However, how true would these claims be if their data storage technology limited them to a single place?

Thanks to globalization, companies now have offices in several countries across numerous continents. With the rise of working from home, businesses need to set up a system that allows workers to get data and make important decisions no matter where they are.

Cloud data warehouses can be accessed from any location on the globe, but on-premise data warehouses can only be accessed from a single location. Access control measures in contemporary data warehouses make sure only the right people have access to the data required for business insight.

Because contemporary data warehouses don’t compromise on data quality for accessibility, firms don’t have to worry about their data quality being compromised when numerous workers use the warehouse simultaneously.

The ability to adapt and grow

Designed for the contemporary corporation, cloud data warehouses are ideal. Cloud-based data warehouses enable enterprises to dynamically allocate resources based on changing business needs.

As an example, a tourist company may need greater computer power for better analytics during the peak season, but just a fraction of that processing capacity during the off-season. On-premises data warehouses do not allow firms to pay just for the resources and capabilities they need, which is impossible with a cloud DWH.

Increased efficiency

In most cloud data warehouses, the workload is distributed across many servers. Large volumes of data may be handled at the same time on these servers without any interruptions.

Organizational data is typically handled in a variety of ways by various departments. This data may be used by the marketing department, for example, in order to get insight into which marketing channels are more effective with certain demographics. Instead, the customer support staff would need data from various touchpoints to determine where the most complaints and comments have been made.

In a modern cloud-based data warehouse, all of an organization’s departments can access important data and make decisions that are based on facts. This can improve productivity . In addition to those stated, there are several more, such as increased data storage, expanded integration, and improved disaster recovery, that should be added to the list.

2. Explain OLAP in the context of a data warehouse.

OLAP is a very interesting analytical solution, even for business-oriented professionals. The candidate should explain the different forms of OLAP and how they are related.

Business analysts, managers, and executives can quickly and consistently use OLAP software to analyze and show data from their company. This software is called OLAP. Analytical and navigational operations can be done at the same time by people who use OLAP systems because they can look at an organization’s aggregated data in a variety of ways.

Denormalized databases are the standard for OLAP (Online Analytical Processing) systems. It is possible to find and show information using these tools, which are able to traverse across data warehouses.

OLAP Architectures

ROLAP (Relational Online Analytical Processing): The query is sent to your relational database server while the cube is on your server.
MOLAP (Multidimensional Online Analytical Processing): Processing done in a multidimensional server.
HOLAP (Hybrid Online Analytical Processing): ROLAP and MOLAP are both used, which is a hybrid of the two.

Read More : Phone Screen Interview vs Actual Phone Interview: Learn the Differences

3. Can you explain the data warehouse architecture?

The architecture of software is extremely complex, but the candidate should know what are the essential components for building a functional data warehouse that meets his or her expectations. In addition, you may ask this to identify the candidate’s knowledge of the different types of possible data warehouse architectures.

The Architecture of a Single-Tier Data Storage Facility

When data is stored in a single-tier data warehouse, it is more densely packed, which minimizes the overall amount of data. This form of warehouse architecture, although advantageous for removing redundancies, is not ideal for organizations with complicated data needs and frequent data flows. When it comes to handling increasingly sophisticated data flows, layered data warehouse designs come into play.

The Architecture of a Data Warehouse with Two Levels

Two-tier data warehouse models, on the other hand, isolate data sources from the warehouse. With the two-tier architecture, one system and one database server are used instead of two. The two-tier architecture is most often utilized by small firms with a single server acting as a data warehouse. Because of its two-tier design, the two-tier building cannot be scaled. The number of people who can utilize it is also limited to a small number.

Data Warehouse Design using a Three-Tier Structure

When you have a three-tier data warehouse architecture, data moves from raw data to important insights in an orderly way. Sometimes, the database server, which makes sense of data from many sources, like transactional databases used by front-end users, is at the bottom of the data warehouse paradigm.

It’s important to have an OLAP server in the intermediate layer. This layer reorganizes the data in a way that makes it easier for the user to conduct a wide variety of analyses and investigations. In its design, it already has a built-in OLAP server that can be used to analyze data.

The tools and API used for high-level data analysis, querying, and reporting are included in the third and topmost tier, the client. People tend to overlook the fourth layer of the data warehouse architecture since it isn’t as critical to the overall design as the other three layers.

The Architecture of a Data Warehouse

The Data Source : An enterprise’s transactional systems might be made up of a variety of data types.
Data Stage : A storage space and a series of operations make up the structure. As the name implies, it’s responsible for extracting data from transactional systems and then doing various data preparation tasks before it can be loaded into the data warehouse. The end user does not see any of this information.
Presentation Server : Environment in which data may be accessed by end users. OLAP technology (Online Analytical Processing) may also be used to store data on these servers, as many data marts only function with data in dimensional form.
Data Mart : Here, the data warehouse is broken down into groups based on the departments or views that users need.
Data Mining : This technique analyzes enormous amounts of data in order to find hidden patterns and relationships. An automated scanning tool is needed to look for trends and patterns based on predefined criteria, which aren’t likely to be found by a simple search in data warehouses, so it’s important to have one.
Data Warehouse Reporting Layer : Access to the BI interface or BI database architecture is provided via the reporting layer in the warehouse. The reporting layer of a data warehouse is like a dashboard for looking at data, making reports, and getting all the important information.

4. In the context of data warehousing, what is data purging?

Data purging is a central term in the life cycle of a data warehouse. The candidate should demonstrate knowledge of the term, and demonstrate their process for deciding when data needs to be purged.

When no longer needed, outdated data may be permanently removed from a storage place via data cleansing. Data purges are often based on the age of the data or the nature of the data. An archive is a copy of deleted material that has been preserved in a different storage place.

Data may be deleted from the main storage site, but it can be recovered and restored from an archive copy in the event of a future necessity. Data may be permanently deleted from a storage place, but the delete operation doesn’t preserve a backup.

Read More : How to Write a Great Thank-You Email After an Interview

5. What is the relationship between data warehousing and business intelligence?

This is a very important question. Many people perceive a data warehouse resource as a resource that is only interesting to a data engineer, or other technical position that is responsible for managing this architecture. However, the candidate must identify the role of the business intelligence professional in analyzing the data available in a data warehouse.

Business intelligence (BI) and advanced analytics are two of the primary functions of a data warehouse, which is a sort of data management system. For sophisticated query and analytical reasons, data warehouses are often built to hold a huge quantity of historical data. They get data from application log files and apps that do business.

A data warehouse is a place where enormous volumes of data from many sources may be gathered and consolidated. Organizations may utilize their analytical skills to get valuable business insights from their data and make better decisions. Data scientists and business analysts may greatly benefit from the accumulation of historical records. A data warehouse might be regarded as the “one true source” of an organization because of these features.

1. Data warehousing requires a certain set of abilities from an IT expert. Can you cite them?

In a business environment, the data warehouse interests more than one type of professional. The candidate should have each utility well defined when you ask them this question. Precisely because of the ways in which a data engineer and a data analyst differ when it comes to using the resources of a data warehouse, you’ll be interested in the candidate’s answer.

Depending on the job they’re applying for, they’ll need a different set of talents. Data warehousing interviews may be conducted for a variety of positions. Data warehousing can lead to a wide range of jobs, each of which requires a certain set of skills.

The responsibilities of business intelligence

In this category are positions like business intelligence expert and business intelligence analyst. To succeed in this position, candidates should be able to think critically, analyze data, solve problems, and communicate effectively, as well as understand the sector. There are several responsibilities that fall within the purview of a business intelligence specialist.

Positions in the field of data analysis

Data analysts, data scientists, and data architects are all possible job titles for someone working in this field. These experts are very knowledgeable. Mathematical aptitude, data analytic abilities, expertise with data modeling, familiarity with programming languages like SQL, a focus on deadlines, and the ability to work effectively in a group are all desirable qualities. Those who work in the IT department and may also be developers are included in this group.

The responsibilities of software developers

Depending on their qualifications and experience, a software developer may find himself in an interview for a data warehousing position. For example, anybody who needs to write SQL code for a database would benefit from being familiar with the language. Specific credentials, logical thinking, attention to detail, as well as knowledge of backend infrastructure are all examples of desirable traits among job candidates.

2. What is the Chameleon Method?

The chameleon method is a very important concept in the data warehouse context, precisely because its main use is to work on the disadvantages that some components of a data warehouse may have. The interviewer will be interested in a simple and straightforward explanation of the chameleon method in the context of data warehousing.

In data warehousing, traditional models and approaches have limitations. Chameleon is a hierarchical clustering algorithm that addresses these drawbacks. Data items are represented by nodes and weights by edges in a sparse graph that is used for this algorithm’s implementation. The ability to produce and work on big data sets is made possible by its representation.

The two-phase methodology is used to locate the clusters in the data set:

Graph partitioning is the initial step, which allows for the creation of a large number of subclusters.
Agglomerative hierarchical clustering is used in the second phase of the analysis to find real clusters that can be joined with the sub-clusters that are generated.

Read More : 15+ Most Popular Programming Languages & Technologies to Know

3. Is there a standard way to build a data warehouse?

There are several processes for building a data warehouse. Many processes, however, have commonalities among them. The candidate should address these commonalities and address the advantages of this cycle when asked by the interviewer.

Before anything else, we need to have a clear picture of what the user is looking for. We begin by comparing the dimensions and facts required to meet the needs of the management. First, we focus on the DW’s content, not its implementation, so we don’t have to worry about the data’s actual presence.

Data Mapping : Identifying the data’s origins and determining the best path to them is accomplished via a process known as data mapping. By examining the availability of relevant data, we may now check whether or not the desires outlined in the previous stage can be realized.
Staging Area Construction : Following mapping, we create a structure called the Staging Area, which serves as a transition area for DW data. and Dimensions tables get the necessary treatment for future loads in this section, where data is transferred and isolated from the operational systems (OLTP).
Dimensions Constructions : At this stage, the framework for the dimensions in the DW is put together. In the dimensions, we additionally specify the degree of historicity to be assigned to the data.
Facts Constructions : In this stage (after the creation of the Dimensions), we create the Fact’s underlying structure (s). This section evaluates and specifies the level of detail to be stored in each fact. Growth and storage capacity are also assessed as part of this process.
Definition of the General Loading Process : After finishing the preceding phases, we need to build the engine that will allow everything to be loaded, updated, coordinated, and processed in an orderly manner. As a result, the DW’s “brain” is the overall loading process.
Metadata Creation : In order to complete the metadata documentation, we must first create the build process and the data dictionary. Knowledge management relies heavily on metadata.

Remember that the Data Mart is partitioning the DW into subsets of information structured by certain topics. There should be no need to repeat this process unless there is a necessity to do so (which should be done just once, if possible).

Because of their end-to-start reliance, it’s critical that you follow the instructions exactly as they’re listed. To put it another way, once the previous phase is done, then the next step may begin.

Finally, if each of these actions is given the required attention and is completed successfully, the odds of success in the DW building project are almost certain. As a result, we’ll have a central location where we can keep data that’ll be useful to the company when making decisions.

4. The ETL cycle has a three-tiered design. Can you please explain them?

ETL is part of the life cycle of a data warehouse. The candidate should demonstrate mastery by explaining its three phases to the interviewer.

Extraction SQL data extraction is the focus of the process’s initial step. A preliminary examination of the data may be done in a transition region at this point. At this point, we have a single format for all of the data that can be manipulated in subsequent phases. First, the data must be significantly standardized since it is so distinct from one another.
Transformation The information that was extracted and standardized in the extraction step is adapted in the transformation stage. We do sanitization here, which is the process of transforming data. The aim is to provide just the information that may be successfully used to the management for his or her consideration. Creating filters for future analysis such as age, location, time, position and hierarchical level may also be done at this stage.
MOLAP The third and last phase of the procedure is loading the previously arranged data into a new database. In either a corporate or a departmental setting, this may occur (data mart). After completing the previous step, we repeat the process and fix any additional irregularities in the information flow. It is feasible to establish a mapping of all patterns by retaining an ordered data model, which makes them available for future usage.

Consider that ETL is not necessarily conducted in a single data processing environment. Any cloud-based or on-premises program may be used to do this task. The data mining process may also be used in a more advanced level and with the job done, so that it is possible to develop and detect new patterns of user, buyer, or supplier behavior.

ETL is necessary for creating and observing data-related dimension structures and facts in a data warehouse environment. DWs are designed to house data that must be acted upon at some time. As we’ve seen in the last section, here is where the Extraction, Transformation, and Load (ETL) operations come into play.

As a result, ETL acts as a backup for data stored in a data warehouse but not yet accessed by users.

Read More : How to Show Off Your Soft Skills at Software Development Interviews

5. Why would you need a real-time data warehouse?

The candidate must demonstrate knowledge of the advantages and disadvantages of applying an RTDW. You’ll be interested to know the differentiating aspects that drive business agents to adopt this solution.

RTDWs look and feel like typical data warehouses, but they are far quicker and more scalable than traditional data warehouses. You can have “small data” semantics and performance on a “big data” scale.

Because of this, more data is coming into the warehouse more often.
Data may be accessed instantly, with no need for processing, aggregation, or compaction.
The pace at which searches are executed has increased: tiny, selected queries are measured in milliseconds, while big, scan- or compute-heavy queries are handled at very high bandwidth.
Because there aren’t any big changes needed, changes can be made quickly if they need to be made.

Data warehousing has proved that although this may seem straightforward and even easy to some, it is anything but. It is extremely difficult to interact with a large volume of rapidly incoming data, some of which may need to be updated, and a large number of queries with wildly divergent patterns.

important data warehouse interview questions and answers to study

You may use these data warehouse interview questions to find the best applicant, whether you’re a professional preparing for an interview or a hiring manager.

Make sure you keep in mind that technical abilities and expertise are only one component of the recruiting process. In order to secure (or identify the ideal applicant for) the job, both past experience and soft skills are crucial.

You can also explore HireAI to skip the line and:

⚡️ Get instant candidate matches without searching ⚡️ Identify top applicants from our network of 250,000+ devs with no manual screening ⚡️ Hire 4x faster with vetted candidates (qualified and interview-ready)

Dairenkon Majime

Data Scientist Intern & Content Writer

Dairenkon works as a data scientist intern at Loft and as a data learning facilitator and mentor at Tera. In the meantime, he writes articles about technology in general, data science, machine learning, and statistics. At Arc, he shares his technical knowledge on working with data gained from Centro Universitário UNA in Brazil and Tera's Bootcamp as a writing contributor to help other data-focused software developers and data scientists succeed in their careers.

20 Spring Interview Questions and Answers to Know (With MVC & Boot)

software developer or software engineering questions to ask tech recruiters

8 Questions to Ask Recruiters Before Committing to the Dev Hiring Process

how to answer Angular Interview Questions for AngularJS

29 Angular Interview Questions and Answers to Practice & Prepare For

how to answer software Engineering Manager Interview Questions

21 Engineering Manager Interview Questions and Answers to Know

questions to ask at an interview for software developers interview questions to ask

31 Questions to Ask at an Interview for Software Development Jobs

how to answer React Interview Questions for React.js

55+ Top React Interview Questions and Answers to Prepare For

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

Solution for IBM Data Engineer Professional Certificate

xzZero/DataEng_IBM

Folders and files, repository files navigation, dataeng_ibm, description.

In this repo, I recap my solutions for the assignments for the 15-month IBM Data Engineering Professional Specialization on Coursera that I have done in less than 3 weeks. The specialization contains:

Create, design, and manage relational databases and apply database administration (DBA) concepts to RDBMSes such as MySQL, PostgreSQL, and IBM Db2.
Develop and execute SQL queries using SELECT, INSERT, UPDATE, DELETE statements, database functions, stored procedures, Nested Queries, and JOINs.
Demonstrate working knowledge of NoSQL & Big Data using MongoDB, Cassandra, Cloudant, Hadoop, Apache Spark, Spark SQL, Spark ML, Spark Streaming.
Implement ETL & Data Pipelines with Bash, Airflow & Kafka; architect, populate, deploy Data Warehouses; create BI reports & interactive dashboards.

There are 13 courses throughout the specialization and a capstone project at the end:

Introduction to Data Engineer
Python for Data Science, AI & Development
Python Project for Data Engineer
Introduction to Relational Databases (RDBMS)
Databases and SQL for Data Science with Python
Hands-on Introduction to Linux Commands and Shell Scripting
Relational Database Administration (DBA)
ETL and Data Pipelines with Shell, Airflow and Kafka
Getting Started with Data Warehousing and BI Analytics
Introduction to NoSQL Databases
Introduction to Big Data with Spark and Hadoop
Data Engineering and Machine Learning using Spark
Data Engineering Capstone Project

Tools and Technologies

OLTP database - MySQL
NoSql database - MongoDB
Production Data warehouse – DB2 on Cloud
Staging - Data warehouse – PostgreSQL
Big data platform - Hadoop
Big data analytics platform – Spark
Business Intelligence Dashboard - IBM Cognos Analytics
Data Pipelines - Apache Airflow

Certificates

Python 84.1%
Jupyter Notebook 3.0%
PowerShell 0.2%
PLpgSQL 0.0%

7 Best Data Warehouse Tools to Explore in 2024

If you’ve been exploring other aspects of the data analytics pipeline, you’ll most likely encounter data warehouses at some point. These tools are essential for storing and managing large amounts of data in a structured and accessible manner.

With the rapid growth of big data, there has also been an increase in interest in data warehouse tools to store all this data for downstream analysis.

In this article, we will explore seven of the best data warehouse tools that you should consider using in 2024.

1. What is a data warehouse?

A data warehouse is a central repository that stores large amounts of data from various sources in a structured and organized manner. It allows for efficient querying, analysis, and reporting of this data.

Types of data warehouses

Data warehouses can be split into several types based on the type of data stored.

Here are some examples:

Enterprise data warehouse (EDW): Stores all enterprise data in one central location
Operational data store (ODS): Stores real-time data that is frequently accessed and updated
Online analytical processing (OLAP): Used for complex analytical queries on large datasets
Data mart: A subset of a data warehouse that is focused on a specific department or area within an organization

Now, you must be wondering—what is the purpose of having them over databases or Excel flat files?

I’ll explain more below.

2. Why use data warehouses?

Data warehouses are used for a variety of purposes, but the primary reason is to store and organize data in a central location. This allows for faster and more efficient analysis of large datasets.

Other benefits include:

Improved data quality: Data warehouses often have processes in place to ensure data integrity and consistency
Historical data storage: Data warehouses can store large amounts of historical data, allowing for trend analysis and forecasting
Data accessibility: Data warehouses make it easier to access and query data from various sources in one location

3. Who uses data warehouses?

Data warehouses are used by organizations of all sizes and industries to store and manage their large datasets. Most data professionals will be interacting with data warehouses but for different purposes.

Some examples of data professionals that use them are:

Data analyst : Query data warehouses and analyze the data for insights.
Data engineer : Build and maintain the infrastructure for data warehouses.
Business intelligence analyst: Use data warehouses to create reports and visualizations for business stakeholders.
Analytics engineer : Create and optimize data pipelines to load data into the warehouse.

Companies tend to use data warehouses to store large amounts of data from multiple sources, loaded in from sources that contain customer data, sales information, and financial records, for example.

In fact, many companies have also chosen to explore other forms of the data warehouse, such as the data lake and the data lakehouse.

4. Snowflake

Snowflake is a cloud-based data warehouse platform that offers a fully managed environment with automatic scaling and concurrency. It’s known for its ease of use, security, and speed.

Some key features of Snowflake include:

Multi-cluster architecture: Allows for scalability and separation of compute and storage layers
Virtual warehouses: Can be created on-demand to handle different workloads in parallel
Data sharing: Allows for the secure sharing of data between organizations

It uses a cloud-centric approach that ensures seamless scalability and concurrency.

With a unique architecture that segregates storage and computing, Snowflake offers a pay-for-what-you-use pricing model, ensuring cost-effective resource management.

Snowflake is also known to be a common tool used in the modern data stack , integrating well with popular data tools such as dbt, Tableau, and Looker.

5. Amazon S3

Amazon S3 is a highly-scalable, object-based storage service provided by Amazon Web Services (AWS). It’s often used as a data warehouse for storing large amounts of data in its native format, making it incredibly flexible..

Some key features of Amazon S3 include:

Scalability: Can store any amount of data and handle millions of requests per second
AWS integrations: A rich ecosystem of integrated services for data processing and analytics
Cost-effective: Pay-for-what-you-use pricing model

It is a robust and versatile data warehousing solution designed for scalability and durability.

It excels in providing a secure, high-performance backbone for storing and retrieving any amount of data.

Amazon S3 is best suited for organizations that already use Amazon in their tech stack, such as AWS EC2 or Amazon EMR.

6. Google BigQuery

Google BigQuery is a serverless, highly-scalable data warehouse offered by Google Cloud. It’s known for its speed and ease of use, making it an excellent option for organizations that require real-time insights from their data.

Some key features of Google BigQuery include:

Fast querying capabilities: Can process petabytes of data in seconds.
Automatic scaling: No need to manage computing resources, as it is done automatically.
Machine learning capabilities: Can perform advanced analytics and machine learning on the data stored in BigQuery.

As a cloud-native solution, Google BigQuery offers a seamless integration with other Google Cloud services on the Google Cloud Platform and third-party tools.

This means that data transfers from Google Analytics and Google Ads are much more efficient, thus enabling a holistic view of data across various marketing channels.

BigQuery is also a known tool in most implementations of the modern data stack too.

7. Databricks

Databricks is a unified data analytics platform that combines data engineering, data science, and business analytics in one collaborative workspace. It is optimized for large-scale data processing and real-time analytics.

Some key features of Databricks include:

Spark-based engine : Allows for fast data processing and can handle complex queries
Unified analytics platform : Streamline workflows from ETL to model training and deployment.
MLflow : An open-source platform to manage the complete machine learning lifecycle.
Real-time analytics : Process streaming data in real time with structured streaming.
Collaborative notebooks : Enable data teams to collaborate using shared Databricks notebooks and languages.
Databricks SQL : Run SQL queries rapidly and gain insights with optimized execution.
Data governance : Built-in robust security controls and compliance standards.

Databricks’ compatibility with multiple programming languages like Python, SQL, R, and Scala ensures its accessibility for a broad range of data professionals.

8. Amazon Redshift

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse offered by Amazon Web Services (AWS). It’s mainly optimized for querying and analyzing structured and semi-structured data.

It can quickly analyze and report on large datasets, making it suitable for organizations that need quick insights from their data.

Some key features of Amazon Redshift include:

Columnar storage : Organize data in a columnar format to optimize query performance.
Massively parallel processing (MPP) : Distribute queries across multiple nodes to process them in parallel, resulting in faster performance.
Integrations with BI tools: Can be easily integrated with popular BI tools such as Tableau, Looker, and Power BI.
Pay-for-what-you-use pricing : Only pay for the resources used to run queries.
Automatic backups and recovery : Automatically backs up data to S3, making it easy to recover in case of an outage.

Startup companies tend to like using Redshift as they can start small and scale up as needed, adjusting their usage to match their budget and data requirements.

Its integration with AWS ecosystem services like S3, DynamoDB, and Elastic MapReduce further empowers users to blend traditional data warehousing with big data analytics for a holistic insight generation workflow.

Overall, Amazon Redshift is a good data warehouse option if you already use AWS in your data stack or engineering tech stack or if you need a highly-scalable solution for large datasets.

9. Oracle Autonomous Data Warehouse

Oracle Autonomous Data Warehouse is a cloud-based data warehouse service offered by Oracle. It is designed to automate and simplify the process of creating, securing, and managing a data warehouse.

Some key features of Oracle Autonomous Data Warehouse include:

Autonomous operations : Uses machine learning to automatically tune performance, optimize storage, and secure data.
High performance : Can run complex queries in parallel for faster processing.
Elastic scalability : Can scale compute resources up or down based on workload, reducing costs for idle resources.
Built-in machine learning algorithms : Allows for advanced data analysis and predictive modeling.
Low-code app development platform: Users can build and deploy custom applications on the data warehouse without using much code.

Oracle Autonomous Data Warehouse is suitable for organizations that require a robust, enterprise-level data warehouse with advanced machine learning capabilities.

It’s also a good option for organizations that use Oracle databases and want a fully automated solution with high performance and scalability capabilities.

10. PostgreSQL

PostgreSQL is an open-source, relational database management system that is highly customizable and extensible. It supports both structured and unstructured data and offers a wide range of features for data warehousing.

Known for its robustness, extensibility, and SQL compliance, it has stood the test of time as a foundation for many types of data-intensive applications.

As open-source software, it benefits from vast community support and resources, making it an attractive option for organizations of various sizes.

Some key features of PostgreSQL include:

ACID compliance : Ensures data integrity and consistency.
Multi-version concurrency control (MVCC) : Enables multiple users to access and update the same data without interference.
Data types for structured and unstructured data : Can store JSON, XML, and other non-relational data in addition to traditional relational data.
Extensions and plugins : Offers a wide array of extensions and plugins to extend functionality, such as data warehousing features like PostgreSQL Data Warehousing by Citus.
High availability : Supports high availability configurations for automatic failover and data protection.
Scalability : Can scale horizontally by adding more servers, allowing for the handling of large datasets.

Overall, PostgreSQL is a reliable and versatile choice for organizations looking for an open-source, customizable, and scalable data warehousing solution.

11. Next steps

To sum things, up here are some points to take away from this article:

A data warehouse is a central repository that stores data from various sources.
Data warehouses help to organize data for faster and more efficient analysis of large datasets.
Common data warehouse tools used are: Snowflake, Amazon S3, Google BigQuery, Databricks, Amazon Redshift, Oracle, and PostgreSQL.

These tools are some of the most commonly used among data analysts and data engineers. If you’re keen on a career in data, you might want to pick up knowledge on one along the way.

So, now that you’ve learned about various data warehousing solutions, what’s next?

Here are a few steps you can take to continue your learning journey:

Research about data analytics tools : Familiarize yourself with popular data analytics tools .
Learn data analytics : Try out CareerFoundry’s free, 5-day data analytics short course that covers all the basics.

Please enter your information to subscribe to the Microsoft Fabric Blog.

Microsoft fabric updates blog.

Microsoft Fabric May 2024 Update

Monthly Update

Welcome to the May 2024 update.

Here are a few, select highlights of the many we have for Fabric. You can now ask Copilot questions about data in your model, Model Explorer and authoring calculation groups in Power BI desktop is now generally available, and Real-Time Intelligence provides a complete end-to-end solution for ingesting, processing, analyzing, visualizing, monitoring, and acting on events.

There is much more to explore, please continue to read on.

Microsoft Build Announcements

At Microsoft Build 2024, we are thrilled to announce a huge array of innovations coming to the Microsoft Fabric platform that will make Microsoft Fabric’s capabilities even more robust and even customizable to meet the unique needs of each organization. To learn more about these changes, read the “ Unlock real-time insights with AI-powered analytics in Microsoft Fabric ” announcement blog by Arun Ulag.

Fabric Roadmap Update

Last October at the Microsoft Power Platform Community Conference we announced the release of the Microsoft Fabric Roadmap . Today we have updated that roadmap to include the next semester of Fabric innovations. As promised, we have merged Power BI into this roadmap to give you a single, unified road map for all of Microsoft Fabric. You can find the Fabric Roadmap at https://aka.ms/FabricRoadmap .

We will be innovating our Roadmap over the coming year and would love to hear your recommendation ways that we can make this experience better for you. Please submit suggestions at https://aka.ms/FabricIdeas .

Earn a discount on your Microsoft Fabric certification exam!

We’d like to thank the thousands of you who completed the Fabric AI Skills Challenge and earned a free voucher for Exam DP-600 which leads to the Fabric Analytics Engineer Associate certification.

If you earned a free voucher, you can find redemption instructions in your email. We recommend that you schedule your exam now, before your discount voucher expires on June 24 th . All exams must be scheduled and completed by this date.

If you need a little more help with exam prep, visit the Fabric Career Hub which has expert-led training, exam crams, practice tests and more.

Missed the Fabric AI Skills Challenge? We have you covered. For a limited time , you could earn a 50% exam discount by taking the Fabric 30 Days to Learn It Challenge .

Modern Tooltip now on by Default

Matrix layouts, line updates, on-object interaction updates, publish to folders in public preview, you can now ask copilot questions about data in your model (preview), announcing general availability of dax query view, copilot to write and explain dax queries in dax query view public preview updates, new manage relationships dialog, refreshing calculated columns and calculated tables referencing directquery sources with single sign-on, announcing general availability of model explorer and authoring calculation groups in power bi desktop, microsoft entra id sso support for oracle database, certified connector updates, view reports in onedrive and sharepoint with live connected semantic models, storytelling in powerpoint – image mode in the power bi add-in for powerpoint, storytelling in powerpoint – data updated notification, git integration support for direct lake semantic models.

Editor’s pick of the quarter
New visuals in AppSource
Financial Reporting Matrix by Profitbase
Horizon Chart by Powerviz

Milestone Trend Analysis Chart by Nova Silva

Sunburst Chart by Powerviz
Stacked Bar Chart with Line by JTA

Fabric Automation

Streamlining fabric admin apis, microsoft fabric workload development kit, external data sharing, apis for onelake data access roles, shortcuts to on-premises and network-restricted data, copilot for data warehouse.

Unlocking Insights through Time: Time travel in Data warehouse

Copy Into enhancements

Faster workspace resource assignment powered by just in time database attachment, runtime 1.3 (apache spark 3.5, delta lake 3.1, r 4.3.3, python 3.11) – public preview, native execution engine for fabric runtime 1.2 (apache spark 3.4) – public preview , spark run series analysis, comment @tagging in notebook, notebook ribbon upgrade, notebook metadata update notification, environment is ga now, rest api support for workspace data engineering/science settings, fabric user data functions (private preview), introducing api for graphql in microsoft fabric (preview), copilot will be enabled by default, the ai and copilot setting will be automatically delegated to capacity admins, abuse monitoring no longer stores your data, real-time hub, source from real-time hub in enhanced eventstream, use real-time hub to get data in kql database in eventhouse, get data from real-time hub within reflexes, eventstream edit and live modes, default and derived streams, route streams based on content in enhanced eventstream, eventhouse is now generally available, eventhouse onelake availability is now generally available, create a database shortcut to another kql database, support for ai anomaly detector, copilot for real-time intelligence, eventhouse tenant level private endpoint support, visualize data with real-time dashboards, new experience for data exploration, create triggers from real-time hub, set alert on real-time dashboards, taking action through fabric items, general availability of the power query sdk for vs code, refresh the refresh history dialog, introducing data workflows in data factory, introducing trusted workspace access in fabric data pipelines.

Introducing Blob Storage Event Triggers for Data Pipelines
Parent/child pipeline pattern monitoring improvements

Fabric Spark job definition activity now available

Hd insight activity now available, modern get data experience in data pipeline.

Power BI tooltips are embarking on an evolution to enhance their functionality. To lay the groundwork, we are introducing the modern tooltip as the new default , a feature that many users may already recognize from its previous preview status. This change is more than just an upgrade; it’s the first step in a series of remarkable improvements. These future developments promise to revolutionize tooltip management and customization, offering possibilities that were previously only imaginable. As we prepare for the general availability of the modern tooltip, this is an excellent opportunity for users to become familiar with its features and capabilities.

Discover the full potential of the new tooltip feature by visiting our dedicated blog . Dive into the details and explore the comprehensive vision we’ve crafted for tooltips, designed to enhance your Power BI experience.

We’ve listened to our community’s feedback on improving our tabular visuals (Table and Matrix), and we’re excited to initiate their transformation. Drawing inspiration from the familiar PivotTable in Excel , we aim to build new features and capabilities upon a stronger foundation. In our May update, we’re introducing ‘ Layouts for Matrix .’ Now, you can select from compact , outline , or tabular layouts to alter the arrangement of components in a manner akin to Excel.

As an extension of the new layout options, report creators can now craft custom layout patterns by repeating row headers. This powerful control, inspired by Excel’s PivotTable layout, enables the creation of a matrix that closely resembles the look and feel of a table. This enhancement not only provides greater flexibility but also brings a touch of Excel’s intuitive design to Power BI’s matrix visuals. Only available for Outline and Tabular layouts.

To further align with Excel’s functionality, report creators now have the option to insert blank rows within the matrix. This feature allows for the separation of higher-level row header categories, significantly enhancing the readability of the report. It’s a thoughtful addition that brings a new level of clarity and organization to Power BI’s matrix visuals and opens a path for future enhancements for totals/subtotals and rows/column headers.

We understand your eagerness to delve deeper into the matrix layouts and grasp how these enhancements fulfill the highly requested features by our community. Find out more and join the conversation in our dedicated blog , where we unravel the details and share the community-driven vision behind these improvements.

Following last month’s introduction of the initial line enhancements, May brings a groundbreaking set of line capabilities that are set to transform your Power BI experience:

Hide/Show lines : Gain control over the visibility of your lines for a cleaner, more focused report.
Customized line pattern : Tailor the pattern of your lines to match the style and context of your data.
Auto-scaled line pattern : Ensure your line patterns scale perfectly with your data, maintaining consistency and clarity.
Line dash cap : Customize the end caps of your customized dashed lines for a polished, professional look.
Line upgrades across other line types : Experience improvements in reference lines, forecast lines, leader lines, small multiple gridlines, and the new card’s divider line.

These enhancements are not to be missed. We recommend visiting our dedicated blog for an in-depth exploration of all the new capabilities added to lines, keeping you informed and up to date.

This May release, we’re excited to introduce on-object formatting support for Small multiples , Waterfall , and Matrix visuals. This new feature allows users to interact directly with these visuals for a more intuitive and efficient formatting experience. By double-clicking on any of these visuals, users can now right-click on the specific visual component they wish to format, bringing up a convenient mini-toolbar. This streamlined approach not only saves time but also enhances the user’s ability to customize and refine their reports with ease.

We’re also thrilled to announce a significant enhancement to the mobile reporting experience with the introduction of the pane manager for the mobile layout view. This innovative feature empowers users to effortlessly open and close panels via a dedicated menu, streamlining the design process of mobile reports.

We recently announced a public preview for folders in workspaces, allowing you to create a hierarchical structure for organizing and managing your items. In the latest Desktop release, you can now publish your reports to specific folders in your workspace.

When you publish a report, you can choose the specific workspace and folder for your report. The interface is simplistic and easy to understand, making organizing your Power BI content from Desktop better than ever.

To publish reports to specific folders in the service, make sure the “Publish dialogs support folder selection” setting is enabled in the Preview features tab in the Options menu.

Learn more about folders in workspaces.

We’re excited to preview a new capability for Power BI Copilot allowing you to ask questions about the data in your model! You could already ask questions about the data present in the visuals on your report pages – and now you can go deeper by getting answers directly from the underlying model. Just ask questions about your data, and if the answer isn’t already on your report, Copilot will then query your model for the data instead and return the answer to your question in the form of a visual!

We’re starting this capability off in both Edit and View modes in Power BI Service. Because this is a preview feature, you’ll need to enable it via the preview toggle in the Copilot pane. You can learn more about all the details of the feature in our announcement post here! (will link to announcement post)

We are excited to announce the general availability of DAX query view. DAX query view is the fourth view in Power BI Desktop to run DAX queries on your semantic model.

DAX query view comes with several ways to help you be as productive as possible with DAX queries.

Quick queries. Have the DAX query written for you from the context menu of tables, columns, or measures in the Data pane of DAX query view. Get the top 100 rows of a table, statistics of a column, or DAX formula of a measure to edit and validate in just a couple clicks!
DirectQuery model authors can also use DAX query view. View the data in your tables whenever you want!
Create and edit measures. Edit one or multiple measures at once. Make changes and see the change in action in a DA query. Then update the model when you are ready. All in DAX query view!
See the DAX query of visuals. Investigate the visuals DAX query in DAX query view. Go to the Performance Analyzer pane and choose “Run in DAX query view”.
Write DAX queries. You can create DAX queries with Intellisense, formatting, commenting/uncommenting, and syntax highlighting. And additional professional code editing experiences such as “Change all occurrences” and block folding to expand and collapse sections. Even expanded find and replace options with regex.

Learn more about DAX query view with these resources:

Deep dive blog: https://powerbi.microsoft.com/blog/deep-dive-into-dax-query-view-and-writing-dax-queries/
Learn more: https://learn.microsoft.com/power-bi/transform-model/dax-query-view
Video: https://youtu.be/oPGGYLKhTOA?si=YKUp1j8GoHHsqdZo

DAX query view includes an inline Fabric Copilot to write and explain DAX queries, which remains in public preview. This month we have made the following updates.

Run the DAX query before you keep it . Previously the Run button was disabled until the generated DAX query was accepted or Copilot was closed. Now you can Run the DAX query then decide to Keep or Discard the DAX query.

2. Conversationally build the DAX query. Previously the DAX query generated was not considered if you typed additional prompts and you had to keep the DAX query, select it again, then use Copilot again to adjust. Now you can simply adjust by typing in additional user prompts.

3. Syntax checks on the generated DAX query. Previously there was no syntax check before the generated DAX query was returned. Now the syntax is checked, and the prompt automatically retried once. If the retry is also invalid, the generated DAX query is returned with a note that there is an issue, giving you the option to rephrase your request or fix the generated DAX query.

4. Inspire buttons to get you started with Copilot. Previously nothing happened until a prompt was entered. Now click any of these buttons to quickly see what you can do with Copilot!

Learn more about DAX queries with Copilot with these resources:

Deep dive blog: https://powerbi.microsoft.com/en-us/blog/deep-dive-into-dax-query-view-with-copilot/
Learn more: https://learn.microsoft.com/en-us/dax/dax-copilot
Video: https://www.youtube.com/watch?v=0kE3TE34oLM

We are excited to introduce you to the redesigned ‘Manage relationships’ dialog in Power BI Desktop! To open this dialog simply select the ‘Manage relationships’ button in the modeling ribbon.

Once opened, you’ll find a comprehensive view of all your relationships, along with their key properties, all in one convenient location. From here you can create new relationships or edit an existing one.

Additionally, you have the option to filter and focus on specific relationships in your model based on cardinality and cross filter direction.

Learn more about creating and managing relationships in Power BI Desktop in our documentation .

Ever since we released composite models on Power BI semantic models and Analysis Services , you have been asking us to support the refresh of calculated columns and tables in the Service. This month, we have enabled the refresh of calculated columns and tables in Service for any DirectQuery source that uses single sign-on authentication. This includes the sources you use when working with composite models on Power BI semantic models and Analysis Services.

Previously, the refresh of a semantic model that uses a DirectQuery source with single-sign-on authentication failed with one of the following error messages: “Refresh is not supported for datasets with a calculated table or calculated column that depends on a table which references Analysis Services using DirectQuery.” or “Refresh over a dataset with a calculated table or a calculated column which references a Direct Query data source is not supported.”

Starting today, you can successfully refresh the calculated table and calculated columns in a semantic model in the Service using specific credentials as long as:

You used a shareable cloud connection and assigned it and/or.
Enabled granular access control for all data connection types.

Here’s how to do this:

Create and publish your semantic model that uses a single sign-on DirectQuery source. This can be a composite model but doesn’t have to be.
In the semantic model settings, under Gateway and cloud connections , map each single sign-on DirectQuery connection to a specific connection. If you don’t have a specific connection yet, select ‘Create a connection’ to create it:

If you are creating a new connection, fill out the connection details and click Create , making sure to select ‘Use SSO via Azure AD for DirectQuery queries:

Finally, select the connection for each single sign-on DirectQuery source and select Apply :

2. Either refresh the semantic model manually or plan a scheduled refresh to confirm the refresh now works successfully. Congratulations, you have successfully set up refresh for semantic models with a single sign-on DirectQuery connection that uses calculated columns or calculated tables!

We are excited to announce the general availability of Model Explorer in the Model view of Power BI, including the authoring of calculation groups. Semantic modeling is even easier with an at-a-glance tree view with item counts, search, and in context paths to edit the semantic model items with Model Explorer. Top level semantic model properties are also available as well as the option to quickly create relationships in the properties pane. Additionally, the styling for the Data pane is updated to Fluent UI also used in Office and Teams.

A popular community request from the Ideas forum, authoring calculation groups is also included in Model Explorer. Calculation groups significantly reduce the number of redundant measures by allowing you to define DAX formulas as calculation items that can be applied to existing measures. For example, define a year over year, prior month, conversion, or whatever your report needs in DAX formula once as a calculation item and reuse it with existing measures. This can reduce the number of measures you need to create and make the maintenance of the business logic simpler.

Available in both Power BI Desktop and when editing a semantic model in the workspace, take your semantic model authoring to the next level today!

Learn more about Model Explorer and authoring calculation groups with these resources:

Use Model explorer in Power BI (preview) – Power BI | Microsoft Learn
Create calculation groups in Power BI (preview) – Power BI | Microsoft Learn

Data connectivity

We’re happy to announce that the Oracle database connector has been enhanced this month with the addition of Single Sign-On support in the Power BI service with Microsoft Entra ID authentication.

Microsoft Entra ID SSO enables single sign-on to access data sources that rely on Microsoft Entra ID based authentication. When you configure Microsoft Entra SSO for an applicable data source, queries run under the Microsoft Entra identity of the user that interacts with the Power BI report.

We’re pleased to announce the new and updated connectors in this release:

[New] OneStream : The OneStream Power BI Connector enables you to seamlessly connect Power BI to your OneStream applications by simply logging in with your OneStream credentials. The connector uses your OneStream security, allowing you to access only the data you have based on your permissions within the OneStream application. Use the connector to pull cube and relational data along with metadata members, including all their properties. Visit OneStream Power BI Connector to learn more. Find this connector in the other category.
[New] Zendesk Data : A new connector developed by the Zendesk team that aims to go beyond the functionality of the existing Zendesk legacy connector created by Microsoft. Learn more about what this new connector brings.
[New] CCH Tagetik
[Update] Azure Databricks

Are you interested in creating your own connector and publishing it for your customers? Learn more about the Power Query SDK and the Connector Certification program .

Last May, we announced the integration between Power BI and OneDrive and SharePoint. Previously, this capability was limited to only reports with data in import mode. We’re excited to announce that you can now seamlessly view Power BI reports with live connected data directly in OneDrive and SharePoint!

When working on Power BI Desktop with a report live connected to a semantic model in the service, you can easily share a link to collaborate with others on your team and allow them to quickly view the report in their browser. We’ve made it easier than ever to access the latest data updates without ever leaving your familiar OneDrive and SharePoint environments. This integration streamlines your workflows and allows you to access reports within the platforms you already use. With collaboration at the heart of this improvement, teams can work together more effectively to make informed decisions by leveraging live connected semantic models without being limited to data only in import mode.

Utilizing OneDrive and SharePoint allows you to take advantage of built-in version control, always have your files available in the cloud, and utilize familiar and simplistic sharing.

While you told us that you appreciate the ability to limit the image view to only those who have permission to view the report, you asked for changes for the “Public snapshot” mode.

To address some of the feedback we got from you, we have made a few more changes in this area.

Add-ins that were saved as “Public snapshot” can be printed and will not require that you go over all the slides and load the add-ins for permission check before the public image is made visible.
You can use the “Show as saved image” on add-ins that were saved as “Public snapshot”. This will replace the entire add-in with an image representation of it, so the load time might be faster when you are presenting your presentation.

Many of us keep presentations open for a long time, which might cause the data in the presentation to become outdated.

To make sure you have in your slides the data you need, we added a new notification that tells you if more up to date data exists in Power BI and offers you the option to refresh and get the latest data from Power BI.

Developers

Direct Lake semantic models are now supported in Fabric Git Integration , enabling streamlined version control, enhanced collaboration among developers, and the establishment of CI/CD pipelines for your semantic models using Direct Lake.

Learn more about version control, testing, and deployment of Power BI content in our Power BI implementation planning documentation: https://learn.microsoft.com/power-bi/guidance/powerbi-implementation-planning-content-lifecycle-management-overview

Visualizations

Editor’s pick of the quarter .

– Animator for Power BI Innofalls Charts SuperTables Sankey Diagram for Power BI by ChartExpo Dynamic KPI Card by Sereviso Shielded HTML Viewer Text search slicer

New visuals in AppSource

Mapa Polski – Województwa, Powiaty, Gminy Workstream Income Statement Table

Gas Detection Chart

Seasonality Chart PlanIn BI – Data Refresh Service

Chart Flare

PictoBar ProgBar

Counter Calendar Donut Chart image

Financial Reporting Matrix by Profitbase

Making financial statements with a proper layout has just become easier with the latest version of the Financial Reporting Matrix.

Users are now able to specify which rows should be classified as cost-rows, which will make it easier to get the conditional formatting of variances correctly:

Selecting a row, and ticking “is cost” will tag the row as cost. This can be used in conditional formatting to make sure that positive variances on expenses are a bad for the result, while a positive variance on an income row is good for the result.

The new version also includes more flexibility in measuring placement and column subtotals.

Measures can be placed either:

Default (below column headers)
Above column headers

Conditionally hide columns
+ much more

Highlighted new features:

Measure placement – In rows
Select Column Subtotals
New Format Pane design
Row Options

Get the visual from AppSource and find more videos here !

Horizon Chart by Powerviz

A Horizon Chart is an advanced visual, for time-series data, revealing trends and anomalies. It displays stacked data layers, allowing users to compare multiple categories while maintaining data clarity. Horizon Charts are particularly useful to monitor and analyze complex data over time, making this a valuable visual for data analysis and decision-making.

Key Features:

Horizon Styles: Choose Natural, Linear, or Step with adjustable scaling.
Layer: Layer data by range or custom criteria. Display positive and negative values together or separately on top.
Reference Line : Highlight patterns with X-axis lines and labels.
Colors: Apply 30+ color palettes and use FX rules for dynamic coloring.
Ranking: Filter Top/Bottom N values, with “Others”.
Gridline: Add gridlines to the X and Y axis.
Custom Tooltip: Add highest, lowest, mean, and median points without additional DAX.
Themes: Save designs and share seamlessly with JSON files.

Other features included are ranking, annotation, grid view, show condition, and accessibility support.

Business Use Cases: Time-Series Data Comparison, Environmental Monitoring, Anomaly Detection

🔗 Try Horizon Chart for FREE from AppSource

📊 Check out all features of the visual: Demo file

📃 Step-by-step instructions: Documentation

💡 YouTube Video: Video Link

📍 Learn more about visuals: https://powerviz.ai/

✅ Follow Powerviz : https://lnkd.in/gN_9Sa6U

Exciting news! Thanks to your valuable feedback, we’ve enhanced our Milestone Trend Analysis Chart even further. We’re thrilled to announce that you can now switch between horizontal and vertical orientations, catering to your preferred visualization style.

The Milestone Trend Analysis (MTA) Chart remains your go-to tool for swiftly identifying deadline trends, empowering you to take timely corrective actions. With this update, we aim to enhance deadline awareness among project participants and stakeholders alike.

In our latest version, we seamlessly navigate between horizontal and vertical views within the familiar Power BI interface. No need to adapt to a new user interface – enjoy the same ease of use with added flexibility. Plus, it benefits from supported features like themes, interactive selection, and tooltips.

What’s more, ours is the only Microsoft Certified Milestone Trend Analysis Chart for Power BI, ensuring reliability and compatibility with the platform.

Ready to experience the enhanced Milestone Trend Analysis Chart? Download it from AppSource today and explore its capabilities with your own data – try for free!

We welcome any questions or feedback at our website: https://visuals.novasilva.com/ . Try it out and elevate your project management insights now!

Sunburst Chart by Powerviz

Powerviz’s Sunburst Chart is an interactive tool for hierarchical data visualization. With this chart, you can easily visualize multiple columns in a hierarchy and uncover valuable insights. The concentric circle design helps in displaying part-to-whole relationships.

Arc Customization: Customize shapes and patterns.
Color Scheme: Accessible palettes with 30+ options.
Centre Circle: Design an inner circle with layers. Add text, measure, icons, and images.
Conditional Formatting: Easily identify outliers based on measure or category rules.
Labels: Smart data labels for readability.
Image Labels: Add an image as an outer label.
Interactivity: Zoom, drill down, cross-filtering, and tooltip features.

Other features included are annotation, grid view, show condition, and accessibility support.

Business Use Cases:

Sales and Marketing: Market share analysis and customer segmentation.
Finance : Department budgets and expenditures distribution.
Operations : Supply chain management.
Education : Course structure, curriculum creation.
Human Resources : Organization structure, employee demographics.

🔗 Try Sunburst Chart for FREE from AppSource

Stacked Bar Chart with Line by JTA

Clustered bar chart with the possibility to stack one of the bars

Stacked Bar Chart with Line by JTA seamlessly merges the simplicity of a traditional bar chart with the versatility of a stacked bar, revolutionizing the way you showcase multiple datasets in a single, cohesive display.

Unlocking a new dimension of insight, our visual features a dynamic line that provides a snapshot of data trends at a glance. Navigate through your data effortlessly with multiple configurations, gaining a swift and comprehensive understanding of your information.

Tailor your visual experience with an array of functionalities and customization options, enabling you to effortlessly compare a primary metric with the performance of an entire set. The flexibility to customize the visual according to your unique preferences empowers you to harness the full potential of your data.

Features of Stacked Bar Chart with Line:

Stack the second bar
Format the Axis and Gridlines
Add a legend
Format the colors and text
Add a line chart
Format the line
Add marks to the line
Format the labels for bars and line

If you liked what you saw, you can try it for yourself and find more information here . Also, if you want to download it, you can find the visual package on the AppSource .

We have added an exciting new feature to our Combo PRO, Combo Bar PRO, and Timeline PRO visuals – Legend field support . The Legend field makes it easy to visually split series values into smaller segments, without the need to use measures or create separate series. Simply add a column with category names that are adjacent to the series values, and the visual will do the following:

Display separate segments as a stack or cluster, showing how each segment contributed to the total Series value.
Create legend items for each segment to quickly show/hide them without filtering.
Apply custom fill colors to each segment.
Show each segment value in the tooltip

Read more about the Legend field on our blog article

Drill Down Combo PRO is made for creators who want to build visually stunning and user-friendly reports. Cross-chart filtering and intuitive drill down interactions make data exploration easy and fun for any user. Furthermore, you can choose between three chart types – columns, lines, or areas; and feature up to 25 different series in the same visual and configure each series independently.

📊 Get Drill Down Combo PRO on AppSource

🌐 Visit Drill Down Combo PRO product page

Documentation | ZoomCharts Website | Follow ZoomCharts on LinkedIn

We are thrilled to announce that Fabric Core REST APIs are now generally available! This marks a significant milestone in the evolution of Microsoft Fabric, a platform that has been meticulously designed to empower developers and businesses alike with a comprehensive suite of tools and services.

The Core REST APIs are the backbone of Microsoft Fabric, providing the essential building blocks for a myriad of functionalities within the platform. They are designed to improve efficiency, reduce manual effort, increase accuracy, and lead to faster processing times. These APIs help with scale operations more easily and efficiently as the volume of work grows, automate repeatable processes with consistency, and enable integration with other systems and applications, providing a streamlined and efficient data pipeline.

The Microsoft Fabric Core APIs encompasses a range of functionalities, including:

Workspace management: APIs to manage workspaces, including permissions.
Item management: APIs for creating, reading, updating, and deleting items, with partial support for data source discovery and granular permissions management planned for the near future.
Job and tenant management: APIs to manage jobs, tenants, and users within the platform.

These APIs adhere to industry standards and best practices, ensuring a unified developer experience that is both coherent and easy to use.

For developers looking to dive into the details of the Microsoft Fabric Core APIs, comprehensive documentation is available. This includes guidelines on API usage, examples, and articles managed in a centralized repository for ease of access and discoverability. The documentation is continuously updated to reflect the latest features and improvements, ensuring that developers have the most current information at their fingertips. See Microsoft Fabric REST API documentation

We’re excited to share an important update we made to the Fabric Admin APIs. This enhancement is designed to simplify your automation experience. Now, you can manage both Power BI and the new Fabric items (previously referred to as artifacts) using the same set of APIs. Before this enhancement, you had to navigate using two different APIs—one for Power BI items and another for new Fabric items. That’s no longer the case.

The APIs we’ve updated include GetItem , ListItems , GetItemAccessDetails , and GetAccessEntities . These enhancements mean you can now query and manage all your items through a single API call, regardless of whether they’re Fabric types or Power BI types. We hope this update makes your work more straightforward and helps you accomplish your tasks more efficiently.

We’re thrilled to announce the public preview of the Microsoft Fabric workload development kit. This feature now extends to additional workloads and offers a robust developer toolkit for designing, developing, and interoperating with Microsoft Fabric using frontend SDKs and backend REST APIs. Introducing the Microsoft Fabric Workload Development Kit .

The Microsoft Fabric platform now provides a mechanism for ISVs and developers to integrate their new and existing applications natively into Fabric’s workload hub. This integration provides the ability to add net new capabilities to Fabric in a consistent experience without leaving their Fabric workspace, thereby accelerating data driven outcomes from Microsoft Fabric.

By downloading and leveraging the development kit , ISVs and software developers can build and scale existing and new applications on Microsoft Fabric and offer them via the Azure Marketplace without the need to ever leave the Fabric environment.

The development kit provides a comprehensive guide and sample code for creating custom item types that can be added to the Fabric workspace. These item types can leverage the Fabric frontend SDKs and backend REST APIs to interact with other Fabric capabilities, such as data ingestion, transformation, orchestration, visualization, and collaboration. You can also embed your own data application into the Fabric item editor using the Fabric native experience components, such as the header, toolbar, navigation pane, and status bar. This way, you can offer consistent and seamless user experience across different Fabric workloads.

This is a call to action for ISVs, software developers, and system integrators. Let’s leverage this opportunity to create more integrated and seamless experiences for our users.

We’re excited about this journey and look forward to seeing the innovative workloads from our developer community.

We are proud to announce the public preview of external data sharing. Sharing data across organizations has become a standard part of day-to-day business for many of our customers. External data sharing, built on top of OneLake shortcuts, enables seamless, in-place sharing of data, allowing you to maintain a single copy of data even when sharing data across tenant boundaries. Whether you’re sharing data with customers, manufacturers, suppliers, consultants, or partners; the applications are endless.

How external data sharing works

Sharing data across tenants is as simple as any other share operation in Fabric. To share data, navigate to the item to be shared, click on the context menu, and then click on External data share . Select the folder or table you want to share and click Save and continue . Enter the email address and an optional message and then click Send .

The data consumer will receive an email containing a share link. They can click on the link to accept the share and access the data within their own tenant.

Click here for more details about external data sharing .

Following the release of OneLake data access roles in public preview, the OneLake team is excited to announce the availability of APIs for managing data access roles. These APIs can be used to programmatically manage granular data access for your lakehouses. Manage all aspects of role management such as creating new roles, editing existing ones, or changing memberships in a programmatic way.

Do you have data stored on-premises or behind a firewall that you want to access and analyze with Microsoft Fabric? With OneLake shortcuts, you can bring on-premises or network-restricted data into OneLake, without any data movement or duplication. Simply install the Fabric on-premises data gateway and create a shortcut to your S3 compatible, Amazon S3, or Google Cloud Storage data source. Then use any of Fabric’s powerful analytics engines and OneLake open APIs to explore, transform, and visualize your data in the cloud.

Try it out today and unlock the full potential of your data with OneLake shortcuts!

Data Warehouse

We are excited to announce Copilot for Data Warehouse in public preview! Copilot for Data Warehouse is an AI assistant that helps developers generate insights through T-SQL exploratory analysis. Copilot is contextualized your warehouse’s schema. With this feature, data engineers and data analysts can use Copilot to:

Generate T-SQL queries for data analysis.
Explain and add in-line code comments for existing T-SQL queries.
Fix broken T-SQL code.
Receive answers regarding general data warehousing tasks and operations.

There are 3 areas where Copilot is surfaced in the Data Warehouse SQL Query Editor:

Code completions when writing a T-SQL query.
Chat panel to interact with the Copilot in natural language.
Quick action buttons to fix and explain T-SQL queries.

Learn more about Copilot for Data Warehouse: aka.ms/data-warehouse-copilot-docs. Copilot for Data Warehouse is currently only available in the Warehouse. Copilot in the SQL analytics endpoint is coming soon.

Unlocking Insights through Time: Time travel in Data warehouse (public preview)

As data volumes continue to grow in today’s rapidly evolving world of Artificial Intelligence, it is crucial to reflect on historical data. It empowers businesses to derive valuable insights that aid in making well-informed decisions for the future. Preserving multiple historical data versions not only incurred significant costs but also presented challenges in upholding data integrity, resulting in a notable impact on query performance. So, we are thrilled to announce the ability to query the historical data through time travel at the T-SQL statement level which helps unlock the evolution of data over time.

The Fabric warehouse retains historical versions of tables for seven calendar days. This retention allows for querying the tables as if they existed at any point within the retention timeframe. Time travel clause can be included in any top level SELECT statement. For complex queries that involve multiple tables, joins, stored procedures, or views, the timestamp is applied just once for the entire query instead of specifying the same timestamp for each table within the same query. This ensures the entire query is executed with reference to the specified timestamp, maintaining the data’s uniformity and integrity throughout the query execution.

From historical trend analysis and forecasting to compliance management, stable reporting and real-time decision support, the benefits of time travel extend across multiple business operations. Embrace the capability of time travel to navigate the data-driven landscape and gain a competitive edge in today’s fast-paced world of Artificial Intelligence.

We are excited to announce not one but two new enhancements to the Copy Into feature for Fabric Warehouse: Copy Into with Entra ID Authentication and Copy Into for Firewall-Enabled Storage!

Entra ID Authentication

When authenticating storage accounts in your environment, the executing user’s Entra ID will now be used by default. This ensures that you can leverage A ccess C ontrol L ists and R ole – B ased a ccess c ontrol to authenticate to your storage accounts when using Copy Into. Currently, only organizational accounts are supported.

How to Use Entra ID Authentication

Ensure your Entra ID organizational account has access to the underlying storage and can execute the Copy Into statement on your Fabric Warehouse.
Run your Copy Into statement without specifying any credentials; the Entra ID organizational account will be used as the default authentication mechanism.

Copy into firewall-enabled storage

The Copy Into for firewall-enabled storage leverages the trusted workspace access functionality ( Trusted workspace access in Microsoft Fabric (preview) – Microsoft Fabric | Microsoft Learn ) to establish a secure and seamless connection between Fabric and your storage accounts. Secure access can be enabled for both blob and ADLS Gen2 storage accounts. Secure access with Copy Into is available for warehouses in workspaces with Fabric Capacities (F64 or higher).

To learn more about Copy into , please refer to COPY INTO (Transact-SQL) – Azure Synapse Analytics and Microsoft Fabric | Microsoft Learn

We are excited to announce the launch of our new feature, Just in Time Database Attachment, which will significantly enhance your first experience, such as when connecting to the Datawarehouse or SQL endpoint or simply opening an item. These actions trigger the workspace resource assignment process, where, among other actions, we attach all necessary metadata of your items, Data warehouses and SQL endpoints, which can be a long process, particularly for workspaces that have a high number of items.

This feature is designed to attach your desired database during the activation process of your workspace, allowing you to execute queries immediately and avoid unnecessary delays. However, all other databases will be attached asynchronously in the background while you are able to execute queries, ensuring a smooth and efficient experience.

Data Engineering

We are advancing Fabric Runtime 1.3 from an Experimental Public Preview to a full Public Preview. Our Apache Spark-based big data execution engine, optimized for both data engineering and science workflows, has been updated and fully integrated into the Fabric platform.

The enhancements in Fabric Runtime 1.3 include the incorporation of Delta Lake 3.1, compatibility with Python 3.11, support for Starter Pools, integration with Environment and library management capabilities. Additionally, Fabric Runtime now enriches the data science experience by supporting the R language and integrating Copilot.

We are pleased to share that the Native Execution Engine for Fabric Runtime 1.2 is currently available in public preview. The Native Execution Engine can greatly enhance the performance for your Spark jobs and queries. The engine has been rewritten in C++ and operates in columnar mode and uses vectorized processing. The Native Execution Engine offers superior query performance – encompassing data processing, ETL, data science, and interactive queries – all directly on your data lake. Overall, Fabric Spark delivers a 4x speed-up on the sum of execution time of all 99 queries in the TPC-DS 1TB benchmark when compared against Apache Spark. This engine is fully compatible with Apache Spark™ APIs (including Spark SQL API).

It is seamless to use with no code changes – activate it and go. Enable it in your environment for your notebooks and your SJDs.

This feature is in the public preview, at this stage of the preview, there is no additional cost associated with using it.

We are excited to announce the Spark Monitoring Run Series Analysis features, which allow you to analyze the run duration trend and performance comparison for Pipeline Spark activity recurring run instances and repetitive Spark run activities from the same Notebook or Spark Job Definition.

Run Series Comparison: Users can compare the duration of a Notebook run with that of previous runs and evaluate the input and output data to understand the reasons behind prolonged run durations.
Outlier Detection and Analysis: The system can detect outliers in the run series and analyze them to pinpoint potential contributing factors.
Detailed Run Instance Analysis: Clicking on a specific run instance provides detailed information on time distribution, which can be used to identify performance enhancement opportunities.
Configuration Insights : Users can view the Spark configuration used for each run, including auto-tuned configurations for Spark SQL queries in auto-tune enabled Notebook runs.

You can access the new feature from the item’s recent runs panel and Spark application monitoring page.

We are excited to announce that Notebook now supports the ability to tag others in comments, just like the familiar functionality of using Office products!

When you select a section of code in a cell, you can add a comment with your insights and tag one or more teammates to collaborate or brainstorm on the specifics. This intuitive enhancement is designed to amplify collaboration in your daily development work.

Moreover, you can easily configure the permissions when tagging someone who doesn’t have the permission, to make sure your code asset is well managed.

We are thrilled to unveil a significant enhancement to the Fabric notebook ribbon, designed to elevate your data science and engineering workflows.

In the new version, you will find the new Session connect control on the Home tab, and now you can start a standard session without needing to run a code cell.

You can also easily spin up a High concurrency session and share the session across multiple notebooks to improve the compute resource utilization. And you can easily attach/leave a high concurrency session with a single click.

The “ View session information ” can navigate you to the session information dialog, where you can find a lot of useful detailed information, as well as configure the session timeout. The diagnostics info is essentially helpful when you need support for notebook issues.

Now you can easily access the powerful “ Data Wrangler ” on Home tab with the new ribbon! You can explore your data with the fancy low-code experience of data wrangler, and the pandas DataFrames and Spark DataFrames are all supported.

We recently made some changes to the Fabric notebook metadata to ensure compliance and consistency:

Notebook file content:

The keyword “trident” has been replaced with “dependencies” in the notebook content. This adjustment ensures consistency and compliance.
Notebook Git format:
The preface of the notebook has been modified from “# Synapse Analytics notebook source” to “# Fabric notebook source”.
Additionally, the keyword “synapse” has been updated to “dependencies” in the Git repo.

The above changes will be marked as ‘uncommitted’ for one time if your workspace is connected to Git. No action is needed in terms of these changes , and there won’t be any breaking scenario within the Fabric platform . If you have any further updates or questions, feel free to share with us.

We are thrilled to announce that the environment is now a generally available item in Microsoft Fabric. During this GA timeframe, we have shipped a few new features of Environment.

Git support

The environment is now Git supported. You can check-in the environment into your Git repo and manipulate the environment locally with its YAML representations and custom library files. After updating the changes from local to Fabric portal, you can publish them by manual action or through REST API.

Deployment pipeline

Deploying environments from one workspace to another is supported. Now, you can deploy the code items and their dependent environments together from development to test and even production.

With the REST APIs, you can have the code-first experience with the same abilities through Fabric portal. We provide a set of powerful APIs to ensure you the efficiency in managing your environment. You can create new environments, update libraries and Spark compute, publish the changes, delete an environment, attach the environment to a notebook, etc., all actions can be done locally in the tools of your choice. The article – Best practice of managing environments with REST API could help you get started with several real-world scenarios.

Resources folder

Resources folder enables managing small resources in the development cycle. The files uploaded in the environment can be accessed from notebooks once they’re attached to the same environment. The manipulation of the files and folders of resources happens in real-time. It could be super powerful, especially when you are collaborating with others.

Sharing your environment with others is also available. We provide several sharing options. By default, the view permission is shared. If you want the recipient to have access to view and use the contents of the environment, sharing without permission customization is the best option. Furthermore, you can grant editing permission to allow recipients to update this environment or grant share permission to allow recipients to reshare this environment with their existing permissions.

We are excited to announce the REST api support for Fabric Data Engineering/Science workspace settings. Data Engineering/Science settings allows users to create/manage their Spark compute, select the default runtime/default environment, enable or disable high concurrency mode or ML autologging.

Now with the REST api support for the Data Engineering/Science settings, you would be able to

Choose the default pool for a Fabric Workspace
Configure the max nodes for Starter pools
Create/Update/Delete the existing Custom Pools, Autoscale and Dynamic allocation properties
Choose Workspace Default Runtime and Environment
Select a default runtime
Select the default environment for the Fabric workspace
Enable or Disable High Concurrency Mode
Enable or Disable ML Auto logging.

Learn more about the Workspace Spark Settings API in our API documentation Workspace Settings – REST API (Spark) | Microsoft Learn

We are excited to give you a sneak peek at the preview of User Data Functions in Microsoft Fabric. User Data Functions gives developers and data engineers the ability to easily write and run applications that integrate with resources in the Fabric Platform. Data engineering often presents challenges with data quality or complex data analytics processing in data pipelines, and using ETL tools may present limited flexibility and ability to customize to your needs. This is where User data functions can be used to run data transformation tasks and perform complex business logic by connecting to your data sources and other workloads in Fabric.

During preview, you will be able to use the following features:

Use the Fabric portal to create new User Data Functions, view and test them.
Write your functions using C#.
Use the Visual Studio Code extension to create and edit your functions.
Connect to the following Fabric-native data sources: Data Warehouse, Lakehouse and Mirrored Databases.

You can now create a fully managed GraphQL API in Fabric to interact with your data in a simple, flexible, and powerful way. We’re excited to announce the public preview of API for GraphQL, a data access layer that allows us to query multiple data sources quickly and efficiently in Fabric by leveraging a widely adopted and familiar API technology that returns more data with less client requests. With the new API for GraphQL in Fabric, data engineers and scientists can create data APIs to connect to different data sources, use the APIs in their workflows, or share the API endpoints with app development teams to speed up and streamline data analytics application development in your business.

You can get started with the API for GraphQL in Fabric by creating an API, attaching a supported data source, then selecting specific data sets you want to expose through the API. Fabric builds the GraphQL schema automatically based on your data, you can test and prototype queries directly in our graphical in-browser GraphQL development environment (API editor), and applications are ready to connect in minutes.

Currently, the following supported data sources can be exposed through the Fabric API for GraphQL:

Microsoft Fabric Data Warehouse
Microsoft Fabric Lakehouse via SQL Analytics Endpoint
Microsoft Fabric Mirrored Databases via SQL Analytics Endpoint

Click here to learn more about how to get started.

Data Science

As you may know, Copilot in Microsoft Fabric requires your tenant administrator to enable the feature from the admin portal. Starting May 20th, 2024, Copilot in Microsoft Fabric will be enabled by default for all tenants. This update is part of our continuous efforts to enhance user experience and productivity within Microsoft Fabric. This new default activation means that AI features like Copilot will be automatically enabled for tenants who have not yet enabled the setting.

We are introducing a new capability to enable Copilot on Capacity level in Fabric. A new option is being introduced in the tenant admin portal, to delegate the enablement of AI and Copilot features to Capacity administrators. This AI and Copilot setting will be automatically delegated to capacity administrators and tenant administrators won’t be able to turn off the delegation.

We also have a cross-geo setting for customers who want to use Copilot and AI features while their capacity is in a different geographic region than the EU data boundary or the US. By default, the cross-geo setting will stay off and will not be delegated to capacity administrators automatically. Tenant administrators can choose whether to delegate this to capacity administrators or not.

Figure 1. Copilot in Microsoft Fabric will be auto enabled and auto delegated to capacity administrators.

Capacity administrators will see the “Copilot and Azure OpenAI Service (preview)” settings under Capacity settings/ Fabric Capacity / <Capacity name> / Delegated tenant settings. By default, the capacity setting will inherit tenant level settings. Capacity administrators can decide whether to override the tenant administrator’s selection. This means that even if Copilot is not enabled on a tenant level, a capacity administrator can choose to enable Copilot for their capacity. With this level of control, we make it easier to control which Fabric workspaces can utilize AI features like Copilot in Microsoft Fabric.

To enhance privacy and trust, we’ve updated our approach to abuse monitoring: previously, we retained data from Copilot in Fabric, including prompt inputs and outputs, for up to 30 days to check for misuse. Following customer feedback, we’ve eliminated this 30-day retention. Now, we no longer store prompt related data, demonstrating our unwavering commitment to your privacy and security. We value your input and take your concerns seriously.

Real-Time Intelligence

This month includes the announcement of Real-Time Intelligence, the next evolution of Real-Time Analytics and Data Activator. With Real-Time Intelligence, Fabric extends to the world of streaming and high granularity data, enabling all users in your organization to collect, analyze and act on this data in a timeline manner making faster and more informed business decisions. Read the full announcement from Build 2024.

Real-Time Intelligence includes a wide range of capabilities across ingestion, processing, analysis, transformation, visualization and taking action. All of this is supported by the Real-Time hub, the central place to discover and manage streaming data and start all related tasks.

Read on for more information on each capability and stay tuned for a series of blogs describing the features in more detail. All features are in Public Preview unless otherwise specified. Feedback on any of the features can be submitted at https://aka.ms/rtiidea

Ingest & Process

Introducing the Real-Time hub
Get Events with new sources of streaming and event data
Source from Real-Time Hub in Enhanced Eventstream
Use Real-Time hub to Get Data in KQL Database in Eventhouse
Get data from Real-Time Hub within Reflexes
Eventstream Edit and Live modes
Default and derived streams
Route data streams based on content

Analyze & Transform

Eventhouse GA
Eventhouse OneLake availability GA
Create a database shortcut to another KQL Database
Support for AI Anomaly Detector
Copilot for Real-Time Intelligence
Tenant-level private endpoints for Eventhouse

Visualize & Act

Visualize data with Real-Time Dashboards
New experience for data exploration
Create triggers from Real-Time Hub
Set alert on Real-time Dashboards
Taking action through Fabric Items

Ingest & Process

Real-Time hub is the single place for all data-in-motion across your entire organization. Several key features are offered in Real-Time hub:

1. Single place for data-in-motion for the entire organization

Real-Time hub enables users to easily discover, ingest, manage, and consume data-in-motion from a wide variety of sources. It lists all the streams and KQL tables that customers can directly act on.

2. Real-Time hub is never empty

All data streams in Fabric automatically show up in the hub. Also, users can subscribe to events in Fabric gaining insights into the health and performance of their data ecosystem.

3. Numerous connectors to simplify data ingestion from anywhere to Real-Time hub

Real-Time hub makes it easy for you to ingest data into Fabric from a wide variety of sources like AWS Kinesis, Kafka clusters, Microsoft streaming sources, sample data and Fabric events using the Get Events experience.

There are 3 tabs in the hub:

Data streams : This tab contains all streams that are actively running in Fabric that user has access to. This includes all streams from Eventstreams and all tables from KQL Databases.
Microsoft sources : This tab contains Microsoft sources (that user has access to) and can be connected to Fabric.
Fabric events : Fabric now has event-driven capabilities to support real-time notifications and data processing. Users can monitor and react to events including Fabric Workspace Item events and Azure Blob Storage events. These events can be used to trigger other actions or workflows, such as invoking a data pipeline or sending a notification via email. Users can also send these events to other destinations via Event Streams.

Learn More

You can now connect to data from both inside and outside of Fabric in a mere few steps. Whether data is coming from new or existing sources, streams, or available events, the Get Events experience allows users to connect to a wide range of sources directly from Real-Time hub, Eventstreams, Eventhouse and Data Activator.

This enhanced capability allows you to easily connect external data streams into Fabric with out-of-box experience, giving you more options and helping you to get real-time insights from various sources. This includes Camel Kafka connectors powered by Kafka connect to access popular data platforms, as well as the Debezium connectors for fetching the Change Data Capture (CDC) streams.

Using Get Events, bring streaming data from Microsoft sources directly into Fabric with a first-class experience. Connectivity to notification sources and discrete events is also included, this enables access to notification events from Azure and other clouds solutions including AWS and GCP. The full set of sources which are currently supported are:

Microsoft sources : Azure Event Hubs, Azure IoT hub
External sources : Google Cloud Pub/Sub, Amazon Kinesis Data Streams, Confluent Cloud Kafka
Change data capture databases : Azure SQL DB (CDC), PostgreSQL DB (CDC), Azure Cosmos DB (CDC), MySQL DB (CDC)
Fabric events : Fabric Workspace Item events, Azure Blob Storage events

Learn More

With enhanced Eventstream, you can now stream data not only from Microsoft sources but also from other platforms like Google Cloud, Amazon Kinesis, Database change data capture streams, etc. using our new messaging connectors. The new Eventstream also lets you acquire and route real-time data not only from stream sources but also from discrete event sources, such as: Azure Blob Storage events, Fabric Workspace Item events.

To use these new sources in Eventstream, simply create an eventstream with choosing “Enhanced Capabilities (preview)”.

You will see the new Eventstream homepage that gives you some choices to begin with. By clicking on the “Add external source”, you will find these sources in the Get events wizard that helps you to set up the source in a few steps. After you add the source to your eventstream, you can publish it to stream the data into your eventstream.

Using Eventstream with discrete sources to turn events into streams for more analysis. You can send the streams to different Fabric data destinations, like Lakehouse and KQL Database. After the events are converted, a default stream will appear in Real-Time Hub. To turn them, click Edit on ribbon, select “Stream events” on the source node, and publish your eventstream.

To transform the stream data or route it to different Fabric destinations based on its content, you can click Edit in ribbon and enter the Edit mode. There you can add event processing operators and destinations.

With Real-Time hub embedded in KQL Database experience, each user in the tenant can view and add streams which they have access to and directly ingest it to a KQL Database table in Eventhouse.

This integration provides each user in the tenant with the ability to access and view data streams they are permitted to. They can now directly ingest these streams into a KQL Database table in Eventhouse. This simplifies the data discovery and ingestion process by allowing users to directly interact with the streams. Users can filter data based on the Owner, Parent and Location and provides additional information such as Endorsement and Sensitivity.

You can access this by clicking on the Get Data button from the Database ribbon in Eventhouse.

This will open the Get Data wizard with Real-Time hub embedded.

You can use events from Real-Time hub directly in reflex items as well. From within the main reflex UI, click ‘Get data’ in the toolbar:

This will open a wizard that allows you to connect to new event sources or browse Real-Time Hub to use existing streams or system events.

Search new stream sources to connect to or select existing streams and tables to be ingested directly by Reflex.

You then have access to the full reflex modeling experience to build properties and triggers over any events from Real-Time hub.

Eventstream offers two distinct modes, Edit and Live, to provide flexibility and control over the development process of your eventstream. If you create a new Eventstream with Enhanced Capabilities enabled, you can modify it in an Edit mode. Here, you can design stream processing operations for your data streams using a no-code editor. Once you complete the editing, you can publish your Eventstream and visualize how it starts streaming and processing data in Live mode .

In Edit mode, you can:

Make changes to an Eventstream without implementing them until you publish the Eventstream. This gives you full control over the development process.
Avoid test data being streamed to your Eventstream. This mode is designed to provide a secure environment for testing without affecting your actual data streams.

For Live mode, you can :

Visualize how your Eventstream streams, transforms, and routes your data streams to various destinations after publishing the changes.
Pause the flow of data on selected sources and destinations, providing you with more control over your data streams being streamed into your Eventstream.

When you create a new Eventstream with Enhanced Capabilities enabled, you can now create and manage multiple data streams within Eventstream, which can then be displayed in the Real-Time hub for others to consume and perform further analysis.

There are two types of streams:

Default stream : Automatically generated when a streaming source is added to Eventstream. Default stream captures raw event data directly from the source, ready for transformation or analysis.
Derived stream : A specialized stream that users can create as a destination within Eventstream. Derived stream can be created after a series of operations such as filtering and aggregating, and then it’s ready for further consumption or analysis by other users in the organization through the Real-Time Hub.

The following example shows that when creating a new Eventstream a default stream alex-es1-stream is automatically generated. Subsequently, a derived stream dstream1 is added after an Aggregate operation within the Eventstream. Both default and derived streams can be found in the Real-Time hub.

Customers can now perform stream operations directly within Eventstream’s Edit mode, instead of embedding in a destination. This enhancement allows you to design stream processing logics and route data streams in the top-level canvas. Custom processing and routing can be applied to individual destinations using built-in operations, allowing for routing to distinct destinations within the Eventstream based on different stream content.

These operations include:

Aggregate : Perform calculations such as SUM, AVG, MIN, and MAX on a column of values and return a single result.
Expand : Expand array values and create new rows for each element within the array.
Filter : Select or filter specific rows from the data stream based on a condition.
Group by : Aggregate event data within a certain time window, with the option to group one or more columns.
Manage Fields : Customize your data streams by adding, removing, or changing data type of a column.
Union : Merge two or more data streams with shared fields (same name and data type) into a unified data stream.

Analyze & Transform

Eventhouse, a cutting-edge database workspace meticulously crafted to manage and store event-based data, is now officially available for general use. Optimized for high granularity, velocity, and low latency streaming data, it incorporates indexing and partitioning for structured, semi-structured, and free text data. With Eventhouse, users can perform high-performance analysis of big data and real-time data querying, processing billions of events within seconds. The platform allows users to organize data into compartments (databases) within one logical item, facilitating efficient data management.

Additionally, Eventhouse enables the sharing of compute and cache resources across databases, maximizing resource utilization. It also supports high-performance queries across databases and allows users to apply common policies seamlessly. Eventhouse offers content-based routing to multiple databases, full view lineage, and high granularity permission control, ensuring data security and compliance. Moreover, it provides a simple migration path from Azure Synapse Data Explorer and Azure Data Explorer, making adoption seamless for existing users.

Engineered to handle data in motion, Eventhouse seamlessly integrates indexing and partitioning into its storing process, accommodating various data formats. This sophisticated design empowers high-performance analysis with minimal latency, facilitating lightning-fast ingestion and querying within seconds. Eventhouse is purpose-built to deliver exceptional performance and efficiency for managing event-based data across diverse applications and industries. Its intuitive features and seamless integration with existing Azure services make it an ideal choice for organizations looking to leverage real-time analytics for actionable insights. Whether it’s analyzing telemetry and log data, time series and IoT data, or financial records, Eventhouse provides the tools and capabilities needed to unlock the full potential of event-based data.

We’re excited to announce that OneLake availability of Eventhouse in Delta Lake format is Generally Available.

Delta Lake is the unified data lake table format chosen to achieve seamless data access across all compute engines in Microsoft Fabric.

The data streamed into Eventhouse is stored in an optimized columnar storage format with full text indexing and supports complex analytical queries at low latency on structured, semi-structured, and free text data.

Enabling data availability of Eventhouse in OneLake means that customers can enjoy the best of both worlds: they can query the data with high performance and low latency in their Eventhouse and query the same data in Delta Lake format via any other Fabric engines such as Power BI Direct Lake mode, Warehouse, Lakehouse, Notebooks, and more.

To learn more, please visit https://learn.microsoft.com/en-gb/fabric/real-time-analytics/one-logical-copy

A database shortcut in Eventhouse is an embedded reference to a source database. The source database can be one of the following:

(Now Available) A KQL Database in Real-Time Intelligence
An Azure Data Explorer database

The behavior exhibited by the database shortcut is similar to that of a follower database

The owner of the source database, the data provider, shares the database with the creator of the shortcut in Real-Time Intelligence, the data consumer. The owner and the creator can be the same person. The database shortcut is attached in read-only mode, making it possible to view and run queries on the data that was ingested into the source KQL Database without ingesting it.

This helps with data sharing scenarios where you can share data in-place either within teams, or even with external customers.

AI Anomaly Detector is an Azure service for high quality detection of multivariate and univariate anomalies in time series. While the standalone version is being retired October 2026, Microsoft open sourced the anomaly detection core algorithms and they are now supported in Microsoft Fabric. Users can leverage these capabilities in Data Science and Real-Time Intelligence workload. AI Anomaly Detector models can be trained in Spark Python notebooks in Data Science workload, while real time scoring can be done by KQL with inline Python in Real-Time Intelligence.

We are excited to announce the Public Preview of Copilot for Real-Time Intelligence. This initial version includes a new capability that translates your natural language questions about your data to KQL queries that you can run and get insights.

Your starting point is a KQL Queryset, that is connected to a KQL Database, or to a standalone Kusto database:

Simply type the natural language question about what you want to accomplish, and Copilot will automatically translate it to a KQL query you can execute. This is extremely powerful for users who may be less familiar with writing KQL queries but still want to get the most from their time-series data stored in Eventhouse.

Stay tuned for more capabilities from Copilot for Real-Time Intelligence!

Customers can increase their network security by limiting access to Eventhouse at a tenant-level, from one or more virtual networks (VNets) via private links. This will prevent unauthorized access from public networks and only permit data plane operations from specific VNets.

Visualize & Act

Real-Time Dashboards have a user-friendly interface, allowing users to quickly explore and analyze their data without the need for extensive technical knowledge. They offer a high refresh frequency, support a range of customization options, and are designed to handle big data.

The following visual types are supported, and can be customized with the dashboard’s user-friendly interface:

You can also define conditional formatting rules to format the visual data points by their values using colors, tags, and icons. Conditional formatting can be applied to a specific set of cells in a predetermined column or to entire rows, and lets you easily identify interesting data points.

Beyond the support visual, Real-Time Dashboards provide several capabilities to allow you to interact with your data by performing slice and dice operations for deeper analysis and gaining different viewpoints.

Parameters are used as building blocks for dashboard filters and can be added to queries to filter the data presented by visuals. Parameters can be used to slice and dice dashboard visuals either directly by selecting parameter values in the filter bar or by using cross-filters.
Cross filters allow you to select a value in one visual and filter all other visuals on that dashboard based on the selected data point.
Drillthrough capability allows you to select a value in a visual and use it to filter the visuals in a target page in the same dashboard. When the target page opens, the value is pushed to the relevant filters.

Real-Time Dashboards can be shared broadly and allow multiple stakeholders to view dynamic, real time, fresh data while easily interacting with it to gain desired insights.

Directly from a real-time dashboard, users can refine their exploration using a user-friendly, form-like interface. This intuitive and dynamic experience is tailored for insights explorers craving insights based on real-time data. Add filters, create aggregations, and switch visualization types without writing queries to easily uncover insights.

With this new feature, insights explorers are no longer bound by the limitations of pre-defined dashboards. As independent explorers, they have the freedom for ad-hoc exploration, leveraging existing tiles to kickstart their journey. Moreover, they can selectively remove query segments, and expand their view of the data landscape.

Dive deep, extract meaningful insights, and chart actionable paths forward, all with ease and efficiency, and without having to write complex KQL queries.

Data Activator allows you to monitor streams of data for various conditions and set up actions to be taken in response. These triggers are available directly within the Real-Time hub and in other workloads in Fabric. When the condition is detected, an action will automatically be kicked off such as sending alerts via email or Teams or starting jobs in Fabric items.

When you browse the Real-Time Hub, you’ll see options to set triggers in the detail pages for streams.

Selecting this will open a side panel where you can configure the events you want to monitor, the conditions you want to look for in the events, and the action you want to take while in the Real-Time hub experience.

Completing this pane creates a new reflex item with a trigger that monitors the selected events and condition for you. Reflexes need to be created in a workspace supported by a Fabric or Power BI Premium capacity – this can be a trial capacity so you can get started with it today!

Data Activator has been able to monitor Power BI report data since it was launched, and we now support monitoring of Real-Time Dashboard visuals in the same way.

From real-time dashboard tiles you can click the ellipsis (…) button and select “Set alert”

This opens the embedded trigger pane, where you can specify what conditions, you are looking for. You can choose whether to send email or Teams messages as the alert when these conditions are met.

When creating a new reflex trigger, from Real-time Hub or within the reflex item itself, you’ll notice a new ‘Run a Fabric item’ option in the Action section. This will create a trigger that starts a new Fabric job whenever its condition is met, kicking off a pipeline or notebook computation in response to Fabric events. A common scenario would be monitoring Azure Blob storage events via Real-Time Hub, and running data pipeline jobs when Blog Created events are detected.

This capability is extremely powerful and moves Fabric from a scheduled driven platform to an event driven platform.

Pipelines, spark jobs, and notebooks are just the first Fabric items we’ll support here, and we’re keen to hear your feedback to help prioritize what else we support. Please leave ideas and votes on https://aka.ms/rtiidea and let us know!

Real-Time Intelligence, along with the Real-Time hub, revolutionizes what’s possible with real-time streaming and event data within Microsoft Fabric.

Learn more and try it today https://aka.ms/realtimeintelligence

Data Factory

Dataflow gen2 .

We are thrilled to announce that the Power Query SDK is now generally available in Visual Studio Code! This marks a significant milestone in our commitment to providing developers with powerful tools to enhance data connectivity and transformation.

The Power Query SDK is a set of tools that allow you as the developer to create new connectors for Power Query experiences available in products such as Power BI Desktop, Semantic Models, Power BI Datamarts, Power BI Dataflows, Fabric Dataflow Gen2 and more.

This new SDK has been in public preview since November of 2022, and we’ve been hard at work improving this experience which goes beyond what the previous Power Query SDK in Visual Studio had to offer.

The latest of these biggest improvements was the introduction of the Test Framework in March of 2024 that solidifies the developer experience that you can have within Visual Studio Code and the Power Query SDK for creating a Power Query connector.

The Power Query SDK extension for Visual Studio will be deprecated by June 30, 2024, so we encourage you to give this new Power Query SDK in Visual Studio Code today if you haven’t.

To get started with the Power Query SDK in Visual Studio Code, simply install it from the Visual Studio Code Marketplace . Our comprehensive documentation and tutorials are available to help you harness the full potential of your data.

Join our vibrant community of developers to share insights, ask questions, and collaborate on exciting projects. Our dedicated support team is always ready to assist you with any queries.

We look forward to seeing the innovative solutions you’ll create with the Power Query SDK in Visual Studio Code. Happy coding!

Introducing a convenient enhancement to the Dataflows Gen2 Refresh History experience! Now, alongside the familiar “X” button in the Refresh History screen, you’ll find a shiny new Refresh Button . This small but mighty addition empowers users to refresh the status of their dataflow refresh history status without the hassle of exiting the refresh history and reopening it. Simply click the Refresh Button , and voilà! Your dataflow’s refresh history status screen is updated, keeping you in the loop with minimal effort. Say goodbye to unnecessary clicks and hello to streamlined monitoring!

[New] OneStream : The OneStream Power Query Connector enables you to seamlessly connect Data Factory to your OneStream applications by simply logging in with your OneStream credentials. The connector uses your OneStream security, allowing you to access only the data you have based on your permissions within the OneStream application. Use the connector to pull cube and relational data along with metadata members, including all their properties. Visit OneStream Power BI Connector to learn more. Find this connector in the other category.

Data workflows

We are excited to announce the preview of ‘Data workflows’, a new feature within the Data Factory that revolutionizes the way you build and manage your code-based data pipelines. Powered by Apache Airflow, Data workflows offer seamless authoring, scheduling, and monitoring experience for Python-based data processes defined as Directed Acyclic Graphs (DAGs). This feature brings a SaaS-like experience to running DAGs in a fully managed Apache Airflow environment, with support for autoscaling , auto-pause , and rapid cluster resumption to enhance cost-efficiency and performance.

It also includes native cloud-based authoring capabilities and comprehensive support for Apache Airflow plugins and libraries.

To begin using this feature:

Access the Microsoft Fabric Admin Portal.
Navigate to Tenant Settings.

Under Microsoft Fabric options, locate and expand the ‘Users can create and use Data workflows (preview)’ section. Note: This action is necessary only during the preview phase of Data workflows.

2. Create a new Data workflow within an existing or new workspace.

3. Add a new Directed Acyclic Graph (DAG) file via the user interface.

4. Save your DAG(s).

5. Use Apache Airflow monitoring tools to observe your DAG executions. In the ribbon, click on Monitor in Apache Airflow.

For additional information, please consult the product documentation . If you’re not already using Fabric capacity, consider signing up for the Microsoft Fabric free trial to evaluate this feature.

Data Pipelines

We are excited to announce a new feature in Fabric that enables you to create data pipelines to access your firewall-enabled Azure Data Lake Storage Gen2 (ADLS Gen2) accounts. This feature leverages the workspace identity to establish a secure and seamless connection between Fabric and your storage accounts.

With trusted workspace access, you can create data pipelines to your storage accounts with just a few clicks. Then you can copy data into Fabric Lakehouse and start analyzing your data with Spark, SQL, and Power BI. Trusted workspace access is available for workspaces in Fabric capacities (F64 or higher). It supports organizational accounts or service principal authentication for storage accounts.

How to use trusted workspace access in data pipelines

Create a workspace identity for your Fabric workspace. You can follow the guidelines provided in Workspace identity in Fabric .

Configure resource instance rules for the Storage account that you want to access from your Fabric workspace. Resource instance rules for Fabric workspaces can only be created through ARM templates. Follow the guidelines for configuring resource instance rules for Fabric workspaces here .

Create a data pipeline to copy data from the firewall enabled ADLS gen2 account to a Fabric Lakehouse.

To learn more about how to use trusted workspace access in data pipelines, please refer to Trusted workspace access in Fabric .

We hope you enjoy this new feature for your data integration and analytics scenarios. Please share your feedback and suggestions with us by leaving a comment here.

Introducing Blob Storage Event Triggers for Data Pipelines

A very common use case among data pipeline users in a cloud analytics solution is to trigger your pipeline when a file arrives or is deleted. We have introduced Azure Blob storage event triggers as a public preview feature in Fabric Data Factory Data Pipelines. This utilizes the Fabric Reflex alerts capability that also leverages Event Streams in Fabric to create event subscriptions to your Azure storage accounts.

Parent/Child pipeline pattern monitoring improvements

Today, in Fabric Data Factory Data Pipelines, when you call another pipeline using the Invoke Pipeline activity, the child pipeline is not visible in the monitoring view. We have made updates to the Invoke Pipeline activity so that you can view your child pipeline runs. This requires an upgrade to any pipelines that you have in Fabric that already use the current Invoke Pipeline activity. You will be prompted to upgrade when you edit your pipeline and then provide a connection to your workspace to authenticate. Another additional new feature that will light up with this invoke pipeline activity update is the ability to invoke pipeline across workspaces in Fabric.

We are excited to announce the availability of the Fabric Spark job definition activity for data pipelines. With this new activity, you will be able to run a Fabric Spark Job definition directly in your pipeline. Detailed monitoring capabilities of your Spark Job definition will be coming soon!

To learn more about this activity, read https://aka.ms/SparkJobDefinitionActivity

We are excited to announce the availability of the Azure HDInsight activity for data pipelines. The Azure HDInsight activity allows you to execute Hive queries, invoke a MapReduce program, execute Pig queries, execute a Spark program, or a Hadoop Stream program. Invoking either of the 5 activities can be done in a singular Azure HDInsight activity, and you can invoke this activity using your own or on-demand HDInsight cluster.

To learn more about this activity, read https://aka.ms/HDInsightsActivity

We are thrilled to share the new Modern Get Data experience in Data Pipeline to empower users intuitively and efficiently discover the right data, right connection info and credentials.

In the data destination, users can easily set destination by creating a new Fabric item or creating another destination or selecting existing Fabric item from OneLake data hub.

In the source tab of Copy activity, users can conveniently choose recent used connections from drop down or create a new connection using “More” option to interact with Modern Get Data experience.

Microsoft Fabric March 2024 Update

Welcome to the March 2024 update. We have a lot of great features this month including OneLake File Explorer, Autotune Query Tuning, Test Framework for Power Query SDK in VS Code, and many more! Earn a free Microsoft Fabric certification exam! We are thrilled to announce the general availability of Exam DP-600, which leads to … Continue reading “Microsoft Fabric March 2024 Update”

COMMENTS

Getting Started with Data Warehousing and BI Analytics
There are 4 modules in this course. Kickstart your Data Warehousing and Business Intelligence (BI) Analytics journey with this self-paced course. You will learn how to design, deploy, load, manage, and query data warehouses and data marts. You will also work with BI tools to analyze data in these repositories.
The Ultimate Guide to Data Warehouse Design
8 Steps in Data Warehouse Design. Here are the eight core steps that go into data warehouse design: 1. Defining Business Requirements (or Requirements Gathering) Data warehouse design is a business-wide journey. Data warehouses touch all areas of your business, so every department needs to be on board with the design.
Comp 150-dw Database Warehousing and Data Mining
o Integrate the data from the two separate data sources and import into your data warehouse o Keep the data warehouse as you will utilize it in your next homework assignment. Hints o There is no one right answer to this assignment. Deliverables o The schema for the data warehouse. o The tables themselves (in PostGres).
Data Warehouse
This is the most comprehensive & most modern course you can find on data warehousing. Here is why: Most comprehenisve course with 9 hours video lectures. Learn from a real expert - crystal clear & straight-forward. Master theory & practice - hands-on demonstrations, assignments & quizzes. We will implement a complete data warehouse - end-to-end.
CS345 Assignments
Assignment #2: Extraction, Transformation, and Load: Due Date: Tuesday, November 2. In this programming assignment, you will extract data from three different source systems--a web site, a relational database, and a flat file--and load it into a data warehouse. The assignment will include both an initial load and an incremental load.
IBM-Data-Warehouse-Engineer-Professional-Certificate/Getting Started
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window.
15 Data Warehouse Project Ideas for Practice with Source Code
This project aims to employ dimensional modeling techniques to build a data warehouse. Determine the business requirements and create a data warehouse design schema to meet those objectives. Using SSRS and R, create reports using data from sources. Based on the data warehouse, create an XML schema.
Udemy course optional assignment. Alan Simon
Alan Simon - Data Warehouse Fundamentals for beginners You will be presented with a scenario (described below) as part of a university system's efforts to build a data warehousing environment. Based on the details of the scenario, you will draw diagrams to specify two different architectural alternatives that could be used.
Data Warehouse Concepts, Design and Data Integration
This repository is a project portfolio for the "Data Warehouse Concepts, Design and Data Integration" certification program by Coursera. This program is part 2 of the "Data Warehousing for Business Intelligence" Specialization which prepares students with the design experience, software background, and organizational context to succeed with data warehouse development projects.
Data Warehousing Fundamentals for IT Professionals
O'Reilly members experience books, live events, courses curated by job role, and more from O'Reilly and nearly 200 top publishers. ANSWERS TO SELECTED EXERCISES CHAPTER 1 THE COMPELLING NEED FOR DATA WAREHOUSING 1 Match the columns 1-D, 2-G, 3-I, 4-F, 5-H, 6-A, 7-J, 8-E, 9-B, 10-C CHAPTER 2 DATA WAREHOUSE: ….
15+ Data Warehouse Interview Questions & Answers to Prepare For
In addition to those stated, there are several more, such as increased data storage, expanded integration, and improved disaster recovery, that should be added to the list. 2. Explain OLAP in the context of a data warehouse. OLAP is a very interesting analytical solution, even for business-oriented professionals.
Architecting Data Warehousing solutions with Snowflake
Hope this blog helps you to understand warehouse needs and approaches you need to consider while architecting data warehouse solutions using Snowflake. About Me : I am one of the Snowflake Data ...
Ps Assignment
Ps Assignment-- Solution - Free download as PDF File (.pdf), Text File (.txt) or read online for free. This document contains solutions to several problems related to data warehousing and online analytical processing (OLAP). It includes schemas and SQL queries for multi-dimensional data warehouses with dimensions like time, doctor, patient, course, and semester.
Data Warehouse For Beginners
Description. This course is a beginners course that will show you how to implement enterprise data warehouse solution using Microsoft SQL Server ,Microsoft SQL Server Integration Services SSIS and Microsoft SQL Server Data Tools -SSDT. You will learn how to implement ETL ( Extract,Transform, Load) process using SQL Server Integration services .
Data warehouse assignment with Solution.docx
The list of possible data sources from which we will bring the data into your data warehouse for an insurance company are: Production Data: This form of data comes from various enterprise and various operational systems. We select parts of the data from the different operating modes based on the data requirements in the data warehouse. Internal Data: This type of data is private confidential ...
Fundamentals of Data Warehousing Course by LearnQuest
There are 3 modules in this course. Welcome to Fundamentals of Data Warehousing, the third course of the Key Technologies of Data Analytics specialization. By enrolling in this course, you are taking the next step in your career in data analytics. This course is the third of a series that aims to prepare you for a role working in data analytics.
Solution for IBM Data Engineer Professional Certificate
In this repo, I recap my solutions for the assignments for the 15-month IBM Data Engineering Professional Specialization on Coursera that I have done in less than 3 weeks. The specialization contains: Create, design, and manage relational databases and apply database administration (DBA) concepts to RDBMSes such as MySQL, PostgreSQL, and IBM Db2.
7 Best Data Warehouse Tools to Explore in 2024
Data warehouses help to organize data for faster and more efficient analysis of large datasets. Common data warehouse tools used are: Snowflake, Amazon S3, Google BigQuery, Databricks, Amazon Redshift, Oracle, and PostgreSQL. These tools are some of the most commonly used among data analysts and data engineers.
Challenges and Solutions in Data Mesh
Operational Data is explained as data used to directly run the business and serve the end users. It is collected and then transformed to analytical data. Analytical Data is explained as a non-volatile, integrated, time-variant collection of data transformed from operational data, that is today stored in a data warehouse or lake.
Solved Assignment 3: Big Data and Data Warehouse. In this
Computer Science questions and answers. Assignment 3: Big Data and Data Warehouse. In this assignment you will make recommendations to your client on the collection and management of Big Data, and you will explain how and where all of this data will be stored, meaning what kind of database. What database are you recommending your client choose?
Data Warehouse: Definition, Uses, and Examples
A data warehouse, or "enterprise data warehouse" (EDW), is a central repository system in which businesses store valuable information, such as customer and sales data, for analytics and reporting purposes. Used to develop insights and guide decision-making via business intelligence (BI), data warehouses often contain a combination of both ...
Microsoft Fabric May 2024 Update
Welcome to the May 2024 update. Here are a few, select highlights of the many we have for Fabric. You can now ask Copilot questions about data in your model, Model Explorer and authoring calculation groups in Power BI desktop is now generally available, and Real-Time Intelligence provides a complete end-to-end solution for ingesting, processing, analyzing, visualizing, monitoring, and acting ...

The Ultimate Guide to Data Warehouse Design

Table of Contents

1. Defining Business Requirements (or Requirements Gathering)

2. Setting Up Your Physical Environments

3. Data Warehouse Design: Introducing Data Modeling

4. Choosing Your Extract, Transform, Load (ETL) Solution

5. Online Analytic Processing (OLAP) Cube

6. Data Warehouse Design: Creating the Front End

7. Optimizing Queries

8. Establishing a Rollout Plan

Related Readings

Why Data Democratization Matters Today

Snowpark Unleashed: Data Magic Within Snowflake

The Essential Role of a Data Steward in Modern Business Intelligence

Get the Integrate.io Newsletter

Formerly FlyData

API Generation

15 Data Warehouse Project Ideas for Practice with Source Code

Table of Contents

GCP Project to Learn using BigQuery for Exploring Data

Lower End-User Acceptance Rate

Automation of core management activities

15 Data Warehouse Project Ideas for Practice

Here's what valued users are saying about ProjectPro

Snowflake Real-time Data Warehouse Project

Slowly Changing Dimensions Implementation using Snowflake

Fraud Detection using PaySim Financial Dataset

Anime Recommendation System Data Warehouse Project

Marketing Data Warehouse for Media Research Company

Data Warehouse Design for E-commerce Environments

Data Warehouse Project for Music Data Analysis

Global Sales Data Warehouse Project

Data Warehouse Project for B2B Trading Company

Heart Disease Prediction using Data Warehousing

GCP Data Ingestion using Google Cloud Dataflow

Build Data Pipeline using Dataflow, Apache Beam, Python

Anomaly Detection in IoT-based Security System

AWS Snowflake Data Pipeline using Kinesis and Airflow

Microsoft Azure

Google BigQuery

Amazon Redshift

Start Building Data Warehousing Projects to Get You a Real-World Data Job

FAQs on Data Warehousing Projects

How to define business objectives for data warehousing projects?

About the Author

Data Warehousing Fundamentals for IT Professionals by

ANSWERS TO SELECTED EXERCISES

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

15+ Data Warehouse Interview Questions & Answers to Prepare For

Basic Data Warehouse Interview Questions

1. What is a data warehouse?

2. What’s the difference between a data warehouse and data mining?

3. Data warehouses provide what advantages?

4. What is an aggregate table?

5. What do you understand about Star Schema?

6. Explain what dimension tables and fact tables are.

1. What are the advantages of having a cloud-based data warehouse?

User-friendliness

The ability to adapt and grow

Increased efficiency

2. Explain OLAP in the context of a data warehouse.

3. Can you explain the data warehouse architecture?

4. In the context of data warehousing, what is data purging?

5. What is the relationship between data warehousing and business intelligence?

1. Data warehousing requires a certain set of abilities from an IT expert. Can you cite them?

2. What is the Chameleon Method?

3. Is there a standard way to build a data warehouse?

4. The ETL cycle has a three-tiered design. Can you please explain them?

5. Why would you need a real-time data warehouse?

Dairenkon Majime

Further reading

20 Spring Interview Questions and Answers to Know (With MVC & Boot)

8 Questions to Ask Recruiters Before Committing to the Dev Hiring Process

29 Angular Interview Questions and Answers to Practice & Prepare For

21 Engineering Manager Interview Questions and Answers to Know

31 Questions to Ask at an Interview for Software Development Jobs

55+ Top React Interview Questions and Answers to Prepare For

Navigation Menu