View all on-demand sessions from the Intelligent Security Summit here.
As business data is increasingly produced and consumed outside traditional cloud and data center boundaries, organizations need to rethink how their data is processed in a distributed footprint that spans multiple hybrid and multi-cloud environments and edge locations.
Business is becoming increasingly decentralized. Data is now being produced, processed and consumed all over the world – from remote POS systems and smartphones to connected vehicles and factory floors. This trend, along with the rise of the Internet of Things (IoT), a steady increase in the computing power of edge devices, and better network connectivity, are driving the emergence of the edge computing paradigm.
IDC predicts that in 2023 more than 50% of the new IT infrastructure will be rolled out at the edge. And Gartner has predicted that by 2025, 75% of company data are processed outside of a traditional data center or cloud.
Processing data closer to where it is produced and potentially consumed offers clear benefits, such as saving network costs and reducing latency to deliver a seamless experience. But if it is not used effectively, edge computing can also cause problems such as unplanned downtime, the inability to scale fast enough to meet demand, and vulnerabilities that are exploited by cyberattacks.
Stateful edge applications that capture, store, and consume data require a new data architecture that takes into account the availability, scalability, latency, and security needs of the applications. Organizations that have a geographically distributed infrastructure footprint at the core and the edge need to be aware of several key data design principles, as well as how to address the issues that are likely to arise.
Map the lifecycle of data
Data-driven organizations need to start by understanding the story of their data: where it is produced, what should be done with it, and where it will ultimately be consumed. Is the data produced at the edge or in an application running in the cloud? Should the data be kept for the long term, or should it be stored and forwarded quickly? Do you need to do heavy analytics on the data to train machine learning (ML) models, or perform fast real-time processing on them?
Think first of data flows and data stores. Edge locations have less computing power than the cloud and so may not be ideally suited for long-running analytics and AI/ML. At the same time, moving data from multiple edge locations to the cloud for processing leads to higher latency and network costs.
Very often, data is replicated between the cloud and edge locations, or between different edge locations. Common deployment topologies include:
- Hub and spoke, where data is generated and stored at the edges, with a central cloud cluster collecting data from there. This is common in retail environments and IoT use cases.
- Configurationn, where data is stored in the cloud and read replicas are produced at one or more edge locations. Device configuration settings are common examples.
- Edge-to-edge, a common pattern, where data is replicated or partitioned synchronously or asynchronously within a layer. Typical of this pattern are vehicles moving between edge locations, roaming mobile users, and users moving between countries conducting financial transactions.
By knowing in advance what should be done with the collected data, organizations can deploy an optimal data infrastructure as the basis for stateful applications. It is also important to choose a database that offers flexible built-in data replication capabilities that these topologies allow.
Identify application workloads
Hand in hand with the data lifecycle, it is important to look at the landscape of application workloads that produce, process or consume data. Workloads presented by stateful applications vary in terms of throughput, responsiveness, scale, and data aggregation requirements. For example, a service that analyzes transactional data from all of a retailer’s store locations requires data from the individual stores to be aggregated to the cloud.
These workloads can be classified into seven types.
- Stream data, such as device and user data, plus vehicle telemetry, location data, and other “stuff” in the IoT. Streaming data requires high throughput, fast queries, and may require cleaning before use.
- Analysis via streaming sata, such as when real-time analytics are applied to streaming data to generate alerts. It must be supported natively by the database or by using Spark or Presto.
- Event dataincluding events computed on raw streams stored in the database with atomicity, consistency, isolation, and durability (ACID) guarantees for data validity.
- Smaller datasets with heavy read-only queriesincluding configuration and metadata workloads that are infrequently changed but must be read very quickly.
- transactional, relational workloads, such as those in the areas of identity, access control, security and privacy.
- Full-fledged data analysis, when certain applications need to analyze aggregated data across locations (such as the retail example above).
- Workloads that require long-term data retention including those used for historical comparisons or for use in audit and compliance reports.
Consider latency and throughput needs
Low latency and high throughput data processing are often high priorities for applications on the edge. An organization’s data architecture at the edge must consider factors such as how much data needs to be processed, whether it arrives as single data points or in bursts of activity, and how quickly the data needs to be available to users and applications.
For example, connected vehicle telemetry, credit card fraud detection and other real-time applications should not suffer from the latency that is sent back to a cloud for analysis. They require real-time analytics to be applied right at the edge. Databases deployed at the edge must be able to deliver low latency and/or high data throughput.
Prepare for network partitions
The likelihood of infrastructure outages and network partitions increases as you move from the cloud to the edge. So when designing a edge architecture, consider how ready your applications and databases are to handle network partitions. A network partition is a situation where your infrastructure footprint splits into two or more islands that cannot talk to each other. Partitions can exist in three basic modes between the cloud and the edge.
Usually connected In environments, applications can connect to remote locations to usually – but not always – make an API call. Partitions in this scenario can take several seconds to several hours.
When networks are semi-connectedextended partitions can take hours, so applications must be able to identify changes that occur during the partition and synchronize their status with the external applications once the partition is restored.
In a disconnected environment, which is the most common operating mode at the edge, applications run independently. In rare cases, they can connect to a server, but the vast majority of the time they don’t rely on any external site.
As a rule, applications and databases at the very edge should be ready to operate in disconnected or semi-connected modes. Near-edge applications should be designed for semi-connected or largely connected operations. The cloud itself largely operates in a connected mode, which is necessary for cloud operations, but this is also why a public cloud outage can have such a far-reaching and long-lasting impact.
Ensure agility of the software stack
Enterprises use application suites and must emphasize agility and the ability to design for rapid application iteration. Frameworks that increase developer productivity, such as Spring and GraphQL, support agile design, as do open-source databases PostgreSQL and YugabyteDB.
Prioritize security
Computing at the edge will inherently increase the attack surface, just as moving operations to the cloud does.
It is essential that organizations adopt security strategies based on identities rather than old-fashioned perimeter protections. Implement least-privilege policy, a zero confidence architecture and zero-touch provisioning are essential to an organization’s services and network components.
You should also seriously consider encryption in transit and at rest, multi-tenant support at the database layer, and encryption for each tenant. Adding regional locality of data ensures compliance and makes it easy to apply all required geographic access controls.
The edge is increasingly where computers and transactions take place. By designing data applications that optimize speed, functionality, scalability and security, organizations can get the most out of that computing environment.
Karthik Ranganathan is founder and CTO of Yugabyte.
Data decision makers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.
To read about advanced ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.
You might even consider contribute to an article of your own!