PolyBase

PolyBase enables your SQL Server 2016 instance to process Transact-SQL queries that read data from Hadoop. The same query can also access relational tables in your SQL Server. PolyBase enables the same query to also join the data from Hadoop and SQL Server. In SQL Server, an external table or external data source provides the connection to Hadoop.

PolyBase pushes some computations to the Hadoop node to optimize the overall query. However, PolyBase external access is not limited to Hadoop. Other unstructured non-relational tables are also supported, such as delimited text files.

PolyBase is a technology that accesses external data stored in Azure Blob storage, Hadoop, or Azure Data Lake Store via the Transact-SQL language.

PolyBase is the fastest and most scalable way to load data.

PolyBase can read data from several file formats and data sources.

Azure SQL DW Loading via Polybase

The data warehouse component of Azure Synapse Analytics service is a relational big data store that uses a massively parallel processing (MPP) architecture. It takes advantage of the on-demand elastic scale of Azure compute and storage resources to load and process petabytes of data. With SQL Data Warehouse, you get quicker access to the critical information you need to make good business decisions.

Supported SQL products and services

PolyBase provides these same functionalities for the following SQL products from Microsoft:

  • SQL Server 2016 and later versions (Windows only)

  • Analytics Platform System (formerly Parallel Data Warehouse)

  • Azure SQL Data Warehouse

Azure integration

With the underlying help of PolyBase, T-SQL queries can also import and export data from Azure Blob Storage. Further, PolyBase enables Azure SQL Data Warehouse to import and export data from Azure Data Lake Store, and from Azure Blob Storage.

Why use PolyBase?

In the past it was more difficult to join your SQL Server data with external data. You had the two following unpleasant options:

  • Transfer half your data so that all your data was in one format or the other.

  • Query both sources of data, then write custom query logic to join and integrate the data at the client level.

PolyBase avoids those unpleasant options by using T-SQL to join the data.

To keep things simple, PolyBase does not require you to install additional software to your Hadoop environment. You query external data by using the same T-SQL syntax used to query a database table. The support actions implemented by PolyBase all happen transparently. The query author does not need any knowledge about Hadoop.

PolyBase uses

PolyBase enables the following scenarios in SQL Server:

  • Query data stored in Hadoop from SQL Server or PDW. Users are storing data in cost-effective distributed and scalable systems, such as Hadoop. PolyBase makes it easy to query the data by using T-SQL.

  • Query data stored in Azure Blob Storage. Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL.

  • Import data from Hadoop, Azure Blob Storage, or Azure Data Lake Store. Leverage the speed of Microsoft SQL’s columnstore technology and analysis capabilities by importing data from Hadoop, Azure Blob Storage, or Azure Data Lake Store into relational tables. There is no need for a separate ETL or import tool.

  • Export data to Hadoop, Azure Blob Storage, or Azure Data Lake Store. Archive data to Hadoop, Azure Blob Storage, or Azure Data Lake Store to achieve cost-effective storage and keep it online for easy access.

  • Integrate with BI tools. Use PolyBase with Microsoft’s business intelligence and analysis stack, or use any third party tools that are compatible with SQL Server.

Performance

  • Push computation to Hadoop. The query optimizer makes a cost-based decision to push computation to Hadoop, if that will improve query performance. The query optimizer uses statistics on external tables to make the cost-based decision. Pushing computation creates MapReduce jobs and leverages Hadoop’s distributed computational resources.

  • Scale compute resources. To improve query performance, you can use SQL Server PolyBase scale-out groups. This enables parallel data transfer between SQL Server instances and Hadoop nodes, and it adds compute resources for operating on the external data

Connect PolyBase to the Azure Storage account

We need two pieces of information to connect PolyBase to the Azure Storage account:

  1. The URL of the storage account

  2. The private storage key

Did you find this article valuable?

Support Ian's blog by becoming a sponsor. Any amount is appreciated!