Replicate Data Into Azure Database for MySQL

In today’s post I’ll talk about replicating data in Azure Database for MySQL. Data in replication allows you to synchronize data from MySQL Server running on prem, virtual machines or database services hosted by other cloud providers into the Azure Database for MySQL.

The data in replication is based on the binary (BIN) log file position-based native to MySQL. So, this is the same as if you were running it on-prem and running the BIN log replication for an enterprise-class database service.

The information in the BIN log is stored in different formats according to the database changes that are being recorded. Then sleeves are configured to read the binary log from the master and to execute the events in the BIN log on the sleeve’s local database. You would write the log on the primary server and then the sleeve knows where the primary is and pulls over that information and executes it on the secondary database.

Use cases

A best use case for data in replication is when you’re using a hybrid data solution. With the data in replication, you can keep data synchronized between your on premises servers and your database for MySQL, thus getting a cloud based secondary solution as a fall over replication, disaster recovery and business continuity.

When you want to have an application that is part cloud based and part local, the synchronization is useful for creating those hybrid applications. It’s also appealing when you have an existing local database server but want to move data to a region closer to end users if you’re geo-located.

Another common use case is multi-cloud synchronization, so for complex cloud solutions you can use the data in replication to synchronize data between Azure Database for MySQL and different cloud providers, including virtual machines and database servers hosted in the cloud. So, this is a great option for another layer of redundancy for those cloud deployments.

A few things to keep in mind:

    • The MySQL system database is not replicated, so if you make changes to accounts and permissions on the primary, they will not be replicated. These changes must be done manually on the replica server.
    • The server version must be 5.6 and later.
    • Primary and replica versions must match.
    • Each cable in the database must have a primary key for consistency.
    • Global transaction identifiers are not supported.

So, some things to consider but still a great option for the scalability, redundancy and business continuity.

Overview of Azure Reserve VM Instances

We’re all looking for ways to save money within our Azure subscriptions and resources. How does a savings of up to 72% sound? Today I’d like to give you an Overview of  Azure Reserve Virtual Machine Instances, a payment option which allows you to get that savings off the standard pay as you go plan by pre-committing to a 1 or 3-year term for the compute of virtual machine usage.

If you know you’re going to use Azure virtual machines for an extended period for your cloud workloads, then this is worth looking at. Just keep in mind that this only covers the virtual machine compute; the networking, other software, Azure services or storage, as well as Windows and SQL Server licensing does not get applied to the reserve.

Although, people who have purchased on-prem licensing for their servers can use their Azure hybrid benefit which allows you to bring your own on-prem Windows and SQL licenses to Azure. If you’re currently using an enterprise agreement or pay as you go plan, if you choose to go with Azure Reserve VM Instances, your cost would be reduced against your enterprise agreement or the credit card that you use for your pay as you go plan would be billed according to what you’re using.

When you purchase your Reserve Instances, it’s instantaneous; you just go in and specify your machine type and the term (1 or 3 years). It will detect those machine types in your current subscriptions or if you’re adding new machine types, it will apply that savings to those machine types.

So, if you know you’re going to use a particular machine type for the next year, say for migration, you’ll experience a good savings by pre-committing up front. And the scope of the Reserved Instance can go across multiple subscriptions and apply the discount to each of them.

Gotchas

A couple things to note; first, when the term expires, it does not auto renew and your discount ends. You can renew your contract and choose your hardware that you need; you’re not stuck using the same hardware you originally specified. And second, Reserved Instances cannot be used for enterprise dev test subscriptions or virtual machines in Preview.

Overview of Azure Elastic Database Jobs Service

Today I’ll give an overview of Microsoft’s newly released (in preview) Elastic Database Jobs service. This is considered as a fully hosted Azure service, whereas the previous iteration was a custom hosted and managed version available on SQL DB and SQL DW within Azure.

It’s similar in capability to an on prem SQL Server Agent, but it can reach across multiple servers, subscriptions and regions. SQL Agent is limited to just the instance on the server for the database that you’re managing. This gives you a much wider range across all your different Azure services.

Other benefits and features:

  • Significant capability added that can enable automation and execution of T-SQL jobs with PowerShell, REST API or T-SQL APIs against a group of databases.
  • Can be used for a wide variety of maintenance tasks, such as rebuilding indexes, schema changes, collecting query results and performance monitoring. Think of it in terms of a developer who’s managing many databases across multiple subscriptions to support multiple lines of business or web applications with the same database schema and they want to make a change to it.
  • The capability to maintain a larger number of databases with similar operations and it allows management for whatever databases you specify and that will ensure an optimum customer experience. You’ll also ensure maximum efficiency to maintain your databases without having to set up specific jobs on each of those servers, and to tap into them and make changes more efficiently during off hours and scale up/down when you need to. Plus, you can change that schema across all those databases with a simple interface.
  • Schedule administrative tasks that otherwise would have to be manually done.
  • Allows for some small schema changes, credential management, performance database, or even telemetry collection if you want insight into what people are doing on the databases.
  • Build indexes off hours.
  • Collect query results from multiple databases for central performance management, so you can collect this info into one place, then render info into a portal like Power BI.

Basically, it reduces management maintenance overhead with its ability to go across subscriptions. Normally, you’d have to have that job run on a specific server; but now within Azure, where you are running managed databases, you can run operations across those databases without having to set up separate jobs.

So, a cool feature – it’s now only in preview so it’s sure to grow and I’m excited about the direction.

 

Overview of Azure Operations Management Suite (OMS)

In this post I’d like to give an overview of what Azure Operations Management Suite is and what it can be used for. First, Operations Management Suite, or OMS, is a collection of management services designed for Azure cloud. As new services are added to Azure, more capabilities are being built into OMS to allow for integration.

OMS allows you to collect things in one central place like the many Azure services that need deeper insight and manageability, all from one portal, as well as being able to set up different groups and different ways of viewing your data. OMS can also be used with on prem resources with Window and Linux Agent, so you can collect logs or backup your servers or files to Azure, for example.

The key Operations Management Suite services are:

  • Log analytics allows you to monitor and analyze the availability and performance of different resources including physical and virtual machines, Azure Data Factory and other Azure services.
  • Proactive alerting for when an issue or problem in your environment is detected, so you can either take corrective action or have a preprogrammed corrective action.
  • Ability automate manual processes and enforce configuration for physical and virtual machines, like automating clean-up operations you do on servers for instance. You can do this through Runbooks which are based on PowerShell scripts or PowerShell workloads where you can programmatically do what you need to do within the OMS.
  • Integrate backups so the agent and integration allow for backing up a service, a file level; whatever you need to do for critical data and run those stores, whether they are on-prem or cloud-based resources.
  • Azure Site Recovery runs through OMS and helps you provide high availability for apps and servers that you’re running.
  • Orchestrate running your replication up into Azure. This allows you to do it from physical servers, Hyper Vs or VMware servers using Windows or Linux.

Mainly, it provides management solutions. These are prepackaged sets of templates provided by Microsoft and/or partners that help implement multiple OMS services at one time. One example is the Update Management Solution which creates a log search, dashboard and alerting inside log analytics, but at the same time creates an automation runbook for installing updates on the server. This will tell you when updates are available, when they’re needed and then let you automate the install of those updates.

There is a lot of power and capability that comes with the Operations Management Suite. It’s a great centralized management solution within Azure that is quick to configure and start using.

 

Key Terminology in Azure Databricks

In a previous post, I talked about Azure Databricks and what it is. In review, Azure Databricks is a managed platform for running Apache Spark jobs. As it’s managed, that means you don’t have to worry about managing the cluster or running performance maintenance to use Spark, like you would if you were going to deploy a full HDInsight Spark cluster.

Databricks provides a simple to operate user interface for data scientist and analysts when building models, as well as a powerful API that allows for some automation. You also can run role-based access control with Active Directory for better user integration at a more granular scale. You don’t have to tear down an HDInsight cluster to use Spark jobs as you can pause (or start) your resources on demand and scale up/out as needed.

In this post, I’ll run through some key Databricks terms to give you an overview of the different points you’ll use when running Databricks jobs:

    • Workspace – This is the central place that will allow you to organize all the work that’s being done. You can think of it as a ‘folder’ structure where you can save Notebooks and Libraries that you want to operate on and manipulate data with, and then share them securely with other users. Workspace is not meant for storing data; data should be stored in the data storage.
    • Notebooks – This is a set of any number of cells that allow you to execute commands with a programming language, such as Scala, Python, R or SQL; you can specify the language when you open a cell at the top of the Notebook. Here you can also create a dashboard that allows the output of the code to be shared rather than the code itself, and they can be scheduled as jobs for running pipelines, updating models or dashboards.
    • Libraries – These are packages or modules that provide additional functionality for developing various models for different types of analysis. Like a traditional IDE environment like Visual Studio where you have libraries you can plug in and add.
    • Tables – This is where the structured data is stored that you and your team will use for analysis. They can live in cloud storage or in the cluster that’s being used or store them in memory for faster processing of the data.
    • Clusters – Essentially a group of compute resources being used for operations like executing the code from Notebooks or Libraries. You can also pull in data from raw sources like cloud or structured/semi structured data or the data in the tables I mentioned above. Clusters can be controlled via access policies using Active Directory integration.
    • Jobs – Jobs are a tool that’s used to schedule execution within a cluster. These can be scripts using Python or JAR assemblies and you can create manual triggers that will send the jobs off or run them through a REST API.
    • Apps – Think of these as the third-party components that can tap into your Databricks cluster. A good scenario is visualizing the data with apps like Tableau or Power BI. You can consume the modules that you built and the output of the Notebooks or script that you ran to visualize that data.

Azure Data Factory V2 in GA and New Features

Today I’m excited to talk about the general availability of Azure Data Factory V2, as well as some new features that have been added over the last couple months. If you don’t know, Azure Data Factory Version 2 added some new features that V1 didn’t have.

With ADF V2 you get a browser-based interface using drag and drop technology; V1 was primarily done in the Visual Studio IDE. It also added triggers for scheduling, so you can schedule your jobs when required and in additional ways (which I’ll discuss further in a bit).

Some other features of ADF V2 that came out as it became generally available:

  • Lift and Shift operations for your SSIS packages, so if you have SSIS packages local, you can now Lift and Shift those into compute with the integration runtime service in Data Factory.
  • This also allows for cloud to cloud, cloud to prem, prem to prem and some third-party tools are supported within that as well.
  • Control flow activities like link branching, looping, conditional execution and parameterization.
  • Integration with HD Spark and Databricks for big data workloads and data science.

Some features that have come out more recently:

  • Integration with Key Vault, which gives you the ability to encrypt keys and small secrets like passwords used for keys. You can create a Linked Service to a Key Vault and reference those needed passwords rather than having to store those in search or text files or a PowerShell script and have those open. So, you can use Key Vault to reference back and run workloads without having to expose those passwords.
  • The ability to monitor Data Factory using OMS, Microsoft’s cloud-based management solution that helps you manage and protect your on-prem and cloud infrastructure. This is quick and easy to set up and allows you to reach in to different types of applications in Azure and give you additional visibility and control for things like log analytics, automation, data protection and recovery, as well as security and compliance.
  • You can monitor the overall health of your Data Factories and be able to drill in, see the details and troubleshoot if you’re having problems. This is all enable through Azure Analytics, so you turn on your Azure Analytics and Data Factory, then hook those into your OMS suite and you can monitor it as that central management point.
  • Event based triggering with integration through Data Factory. Now you have event driven architecture where you have a common data integration pattern that involves production. Instead of having to schedule a timed trigger, you can monitor a blob creation or deletion, add that file into there and you can trigger your pipeline based on that.

Azure Data Factory V2 is a neat technology and I’m interested to see where it goes as I’m sure that more features will be coming. If you have questions about Azure Data Factory or any of the new Azure resources, we are the people to talk with. We’re doing a lot of work with our clients using Azure tools and we’d love to talk to you about how we can get you using Azure in your organization.

Business Continuity Strategies in Azure

Keeping businesses online and operational is a key concern, no matter the nature of your downtime. Most companies don’t focus on business continuity until it’s too late or have incomplete, untested barebones recovery plans. High Availability, Disaster Recovery and Backup are all critical to a complete business continuity solution. In a recent webinar, Senior Principal Architect Chris Seferlis discussed how leveraging Azure for disaster recovery and business continuity is the most effective way to ensure you’re protected.

If your business’s data is in the cloud, there is nothing is more pivotal than your cloud backup, recovery and migration procedures. Only 18% of decision makers feel fully prepared to recover their data center in the event of a site failure or disaster. The issues are out-of-date recovery plans and limited back-up and recovery testing.

Most disaster situations are caused by system failures, power failures, natural disasters and cyber-attacks. The challenges businesses face in disaster recovery are significant, including cost, complexity and reliability. To have a successful business continuity strategy, organizations must prioritize high availability, disaster recovery and data back-ups.

Disaster recovery is important; there is always a risk of failure with your data, including software bugs, hardware failure and human error. Important factors to consider are Recovery Time Objective (RTO); the targeted duration of time and a service level within which a business process must be restored after a disaster; and Recovery Point Objective (RPO), the maximum targeted period in which data might be lost from an IT service due to a major incident. Both RTO and RPO are business decisions.

Azure can protect against planned and unplanned events by distributing the placement of VMs across the infrastructure. Azure also helps with Disaster Recovery through consistent backup for Windows Azure VMs and file-system backup for Linux Azure VMs. Additionally, it provides efficient and reliable backups to the cloud with no infrastructure maintenance.

Watch the full webinar here to gain more insight into:

  • Azure Backup Server
  • VMWare VM Backup
  • Azure Site recovery
  • Disaster Recovery for Hyper-V
  • Azure Migration

Click here to view my slides from this presentation. If you’d like to learn more about business continuity using Azure or need help with any Azure project from discussions and planning to implementation, click the link below and talk to us today. We can help no matter where you are on your cloud journey.

Azure Blob Storage Lifecycle Management

When we talk about blob storage, we talk about the three different tiers – hot, cool or archive – for delegating the importance of data and how accessible it is. The challenge has been that when we picked the tier that was pretty much the end of story.

What we want is have our data accessible when and where we need it as it can take some time to pull from cool and archive tiers, as well as be costlier to retrieve. Also, with the more expensive hot tier, data can sit there unnecessarily, and we need a way to move it out after it becomes static or stale.

Here’s some good news! Microsoft recently introduced the public preview of Blob Storage Lifecycle Management. This now makes it easier to manage and automate that movement of data by offering a rule-based policy which you can use to transition your data to the best access tier, as well as expire data at the end of its lifecycle.

This great new toolset allows capability and flexibility to define rules for transitioning blobs to a cooler storage. You can also delete blobs by defining how long a blob should live there, define rules to be executed daily or apply rules to storage containers or subsets of blobs, thus allowing you to access certain blob containers and delete others that you specify based on how you’re moving that data around.

So, you can set up a scenario where data hasn’t been accessed in 3 months and it’s set to be transitioned from hot storage to cool, but then it sits there for 6 more months. You then want to be able to move that data off to archive. These are settings you can change based on the last modification date of the file.

You also can delete blob snapshots that have become stale after a defined period of time. Maybe you set it to delete after 120 days or maybe blobs that haven’t been accessed for a several year period—seven years being the magic number for audits and such.

Microsoft is great at listening to what users have to say and to keep evolving and adding more capability to the technology. If you love data, Azure and Azure Blob Storage as much as I do, let me know by sharing this video.

 

Overview of HDInsight R Server

Today I’ll wrap up my series on HDInsight with R Server. What R Server does is when you create an HDInsight cluster, you can select it as an option and it will provide data scientists, statisticians and R Programmers with on demand access to scalable and distributed methods of analytics on HDInsight.

Where it is open source, R allows you to leverage any of the 8,000+ open source packages. Because it falls in Microsoft’s big data analytics package, it includes the scale R routines. These routines provide things such as descriptive statistics, generalized linear models, logistic regression, classification and regression trees, as well as decision forests.

You can run an edge node outside of a cluster that provides a great place to connect on the cluster. You can also run your R scripts which gives the option of running parallel distributed functions. The models that are built can be downloaded for on prem use and can also be sent to Azure Machine Learning Studio for further processing and scoring.

So, why would you choose the Microsoft R Server over other options?

  • Microsoft is putting a lot behind AI and R Server and this big data offering as part of the HDInsight suite.
  • It provides an internally built set of algorithms and when you combine that with the open source community offerings, you create a bridge for cutting edge AI, machine and deep learning applications.
  • As with other Azure offerings, you’re getting a simplified, secure, highly scalable environment, so instead of wasting time building those clusters in-house, you can focus on the capabilities of the platform itself by quickly and easily spinning up a cluster.

Many of these topics have been discussed throughout this series about the capabilities of HDInsight and what each has to offer. Looking at R, some key features are:

    • R enabled for the R programming language with runtime infrastructure for script execution.
    • Also, Python enabled with runtime infrastructure for Python scripting.
    • Pre-trained models to help with visual analytics and text statement analysis that is ready to score the data you provide.
    • You can put the server into operations and deploy solutions as a web service very quickly; so you spin up your cluster, turn everything on, hook it into your domain, use your domain credentials and start training your models.
    • Remote web execution allows us to work from our work station and train models, rather than having to log directly into the server or use SSH or other means. It allows you to build your scripts locally and then execute them remotely, giving you more flexibility with the way you’re operating.

R Server fits within the Azure and HDInsight ecosystems, so you can use and easily integrate these technologies together, such as integrating with Azure Data Factory or Azure Data Bricks, etc.

Overview of HDInsight Interactive Query

Last week I began a series on HDInsight. Today I’m continuing that series with a focus on Interactive Query. Interactive Query leverages Hive which uses LLAP (Long Live and Process), also known as low latency analytical processing. This allows for interactivity with complex data warehouse-style queries on big data, that is stored in commodity storage, such as a blob or Data Lake Store.

This stand-alone cluster is separate from HDI Hadoop clusters; it only contains the Hive service. The LLAP replaces the direct interaction with the HDFS data node, allowing for caching, prefetching, some light query processing and access control. Heavier query processing workloads are still happening at the yarn container with text orchestration, and that helps with the overall execution.

Obviously, it’s much more efficient to be able to query the data interactively where the data is prepared, rather than needing to move the data from one storage location to another, as we normally would with data warehousing. It allows for faster insight and resiliency, as well as reduced effort and simplified architecture – less components meets more simplicity.

There are several ways to execute Hive queries from Interactive Query:

  • Power BI, so you can tap right into it with your Power BI reports
  • Zeppelin notebooks
  • Visual Studio
  • Ambari with Hive View
  • Beeline from head node or an empty edge node
  • ODBC

You can also leverage existing workloads, so if you’re running batch or ETL workloads using HDInsight, you can attach your Interactive Query cluster to an existing metastore and data storage without any additional overhead.

There may be a need to convert CSV or JSON files into ORC, Parquet or Avro field as they can be more efficient for Hadoop processing. But with Interactive Query, that need is either lessened or eliminated because they can load that data into memory. The queries now determine what is cached and what can just run quickly since it’s running in memory instead of running from a storage area.

It also uses the Enterprise Security Package and Azure Log Analytics. These two features get wrapped into more of a true enterprise offering and allows your users to use their simplified Active Directory domain log in. Users can connect using Interactive Query and run their workloads without having to have a separate set of credentials, plus you can monitor your nodes from the Log Analytics piece. This helps you bring that data into OMS for a top down view and an understanding of what the whole environment looks like.

Interactive Query offers some great opportunities to run things more efficiently and smaller workloads can be run very quickly.