Smart Device Monitoring System : IOT Device System Design

May 01, 2023

Designing a Smart Monitoring System as a SaaS solution is crucial to ensure efficient, effective, and reliable operations. In this context, a Smart Monitoring System can be defined as an intelligent system that continuously monitors and analyses various data points in real-time, providing insights to the end user.

A Smart monitoring system should be capable of identifying any issues in a user's smart home.

For example:

If the bedroom air conditioner is in cooling mode for over 30 minutes but the room temperature isn't decreasing, send an alert to the user saying, "The bedroom is not cooling despite the AC being on."
If an IoT device connected to the Wi-Fi network has a very low signal strength, inform the user to reposition the router closer to the device.
If a high-power consumption device, such as a geyser, remains on for an extended period, notify the user that the device is using excessive power.

Functional Requirement

The service should enable users to add new smart devices for monitoring purposes.
The service should let users create customizable rules that trigger notifications to the end-user when specified conditions are satisfied.
Users should have the ability to view alerts generated by the system.
Notifications should contain practical information that assists users in addressing problems. Users may interact with these notifications, such as turning off a geyser that has been running for too long.
The system should offer a dashboard where users can monitor any existing issues.

Nonfunctional Requirement

High availability: Target uptime of 99.99%
High reliability: Enable users to manage smart devices, proactively address issues, and resolve problems promptly
Scalability: Handle increasing loads effectively
Eventual Consistency: As smart devices continuously send telemetry and logs to the service in real-time, the system achieves eventual consistency with respect to these devices.
Real-time alerts: Enable timely issue resolution
Low latency: Ensure smooth user experience and swift issue resolution
Fault tolerance: Maintain uninterrupted service despite failures or errors
The service should frequently monitor all connected devices and notify the user if a device becomes disconnected, is accidentally unplugged, or becomes unreachable by the service.

Capacity Estimation

Assuming a 500 million active user base.
Each user has an average of two devices (e.g., fridge and geyser).
Each user creates an average of two rules per device, resulting in a total of four rules per user.
Therefore, there will be a total of approximately 2 billion rules (500 million x 4).
Assuming each rule takes up an average of 1KB of storage.
The total storage required to store the rules will be approximately 2 billion KB (or 2 terabytes).
We have not included storage required by telemetry data collected from Smart Devices.

API Design

Add a new smart device:

HTTP Method: POST
Endpoint: /API/devices

Request Body:

{
  "name": "string",
  "device_type": "string",
  "manufacture": "string",
  "location": "string",
  "serial no": "string",
  "ip-address": "string",
  "mac-address": "string"
}

Response: device-id(string)
Response code: 201 (Created)

Create a customizable rule:

HTTP Method: POST
Endpoint: /API/rules

Request Body:

{
    "RuleType": "Wifi/Device",              
    "RuleName": "Monitor Wifi well",        
    "Frequency": "1",                       
    "FrequentUnit": "min",                  
    "Window": 30,                           
    "WindowUnit": "min",                    
    "DeviceIds": "deviceIds",         
    "EvaluationCondition": "{> 20%}"        
}

Response: rule-id(String)
Response code: 201(Created)

List all Smart Devices

HTTP Method: GET
Endpoint: /API/devices

Response Body:

{
  "status": "success",
  "message": "Devices fetched successfully.",
  "data": [
    {
      "deviceId": "device123",
      "name": "Living Room Light",
      "type": "Light",
      "manufacturer": "SmartLights Inc.",
      "userId": "user001",
      "createdAt": "2023-01-15T08:00:00Z",
      "updatedAt": "2023-01-20T12:00:00Z"
    },
    {
      "deviceId": "device456",
      "name": "Kitchen Thermostat",
      "type": "Thermostat",
      "manufacturer": "SmartClimate Inc.",
      "userId": "user001",
      "createdAt": "2023-01-16T09:00:00Z",
      "updatedAt": "2023-01-22T14:00:00Z"
    }
  ]
}

d. Response code: 200(OK)

List all Rules

HTTP Method: GET
Endpoint: /API/rules

Response Body:

{
  "status": "success",
  "message": "Rules fetched successfully.",
  "data": [
    {
      "ruleId": "rule123",
      "ruleType": "Device",
      "ruleName": "Monitor Temperature",
      "frequency": 5,
      "frequencyUnit": "minutes",
      "window": 30,
      "windowUnit": "minutes",
      "deviceIds": ["device001", "device002"],
      "evaluationCondition": "> 20%",
      "status": "active",
      "createdAt": "2023-01-15T08:00:00Z",
      "updatedAt": "2023-01-20T12:00:00Z"
    },
    {
      "ruleId": "rule456",
      "ruleType": "Wifi",
      "ruleName": "Monitor Wifi Usage",
      "frequency": 10,
      "frequencyUnit": "minutes",
      "window": 60,
      "windowUnit": "minutes",
      "deviceIds": ["device003", "device004"],
      "evaluationCondition": "< 50%",
      "status": "inactive",
      "createdAt": "2023-01-16T09:00:00Z",
      "updatedAt": "2023-01-22T14:00:00Z"
    }
  ]
}

Response code: 200(OK)

View alerts generated by the system:

HTTP Method: GET
Endpoint: /API/alerts

Response Body:

{
  "alerts": [
    {
      "alertId": "a1",
      "alertName": "High Temperature",
      "alertMessage": "The living room temperature is above the threshold.",
      "ruleId": "r1",
      "ruleName": "Temperature Monitoring",
      "actionByUser": "Increase AC",
      "actionStatus": "Completed",
      "createdAt": "2023-04-30T10:30:00Z",
      "updatedAt": "2023-04-30T10:35:00Z"
    },
    {
      "alertId": "a2",
      "alertName": "Low Humidity",
      "alertMessage": "The bedroom humidity is below the threshold.",
      "ruleId": "r2",
      "ruleName": "Humidity Monitoring",
      "actionByUser": "Turn on humidifier",
      "actionStatus": "Pending",
      "createdAt": "2023-04-30T11:00:00Z",
      "updatedAt": "2023-04-30T11:00:00Z"
    }
  ]
}

Response code: 200(OK)

Interact with notifications:

Database Design

Device Table: Stores smart device information

Device id (partition key): Unique identifier assigned to each device
Name: Human-readable name of the device
Type: Category or classification of the device
Manufacture: Company or entity that produced the device
UserId: Identifier linking the device to a specific user
createdAt: Timestamp when the device was added to the system
updatedAt: Timestamp of the most recent update made to the device entry

Rules Table: Stores rules created by the user

Rule Id (partition key): Unique identifier assigned to each rule
RuleType: Category of the rule (e.g., Wifi/Device)
RuleName: Descriptive name of the rule
Frequency: Numeric value representing rule evaluation frequency
FrequencyUnit: Unit of time for the frequency (e.g., minutes, hours)
Window: Numeric value representing the time window for evaluation
WindowUnit: Unit of time for the window (e.g., minutes, hours)
DeviceIds: List of device identifiers associated with the rule
EvaluationCondition: Condition to trigger the rule (e.g., > 20%)
Status: Current state of the rule, e.g., active or inactive
createdAt: Timestamp when the rule was created
updatedAt: Timestamp of the most recent update made to the rule entry

Alerts Table: Stores alerts created by the system based on rules.

Alert Id: Unique identifier assigned to each alert
Alert Name: Descriptive name of the alert
Alert Message: Brief explanation of the alert's cause or issue
Rule Id: Identifier linking the alert to a specific rule
Device Id: Unique identifier assigned to each device
Rule Name: Name of the rule that generated the alert
Status: Alert Status like pending, resolved
ActionByUser: User response to the alert, if applicable
ActionStatus: Current state of the user's response to the alert
CreatedAt: Timestamp when the alert was created
UpdatedAt: Timestamp of the most recent update made to the alert entry

UserDeviceAlertMapping Table: Holds the association between User Id and Device Id, enabling queries like listing all devices/alerts for a specific user on the dashboard.

User Id(partition key): Unique identifier assigned to each user
Device Id: Unique identifier assigned to each device.
Alerts Id: Unique identifier assigned to each alert. 
CreatedAt: Timestamp when the user-device association was created
UpdatedAt: Timestamp of the most recent update made.

DeviceIDToRuleID Table: Holds the association between Device Id and User Id, enabling queries like listing all rules for a specific device on the dashboard.

Device Id (partition key): Unique identifier assigned to each device
Rule Id: Unique identifier associated with a specific rule
CreatedAt: Timestamp indicating when the entry was created

Considering the query access patterns and data modeling requirements, the following factors should be taken into account before selecting a database:

ACID properties support is not necessary.
A flexible table schema is needed to accommodate varying requirements.
Horizontal partitioning of data is essential.
Based on the table schema, using the partition key should enable efficient data retrieval in a single seek. LSM (Log-Structured Merge-tree) engine databases are more suitable in this regard, as they store partitioned data together. In contrast, B+ tree storage engines can spread data across disk sectors, making it inefficient to fetch all rows matching a given partition key.
A database with high read/write throughput capabilities is required.

Given these factors, a Key-Value NoSQL database like Apache Cassandra or Amazon DynamoDB would be well-suited for the above database design schema. These databases are known for their high availability, scalability, and ability to handle large amounts of data and high read/write throughput, making them ideal for IoT applications.

System Design

Smart API Service

The service in the above diagram is designed to expose API endpoints that enable users to perform various tasks, including onboarding new devices for monitoring, creating new rules, fetching and updating existing rules, and checking the status of ongoing alerts. As a stateless service, it can be horizontally scaled to handle varying loads, ensuring optimal performance and responsiveness.

Smart Database

This database cluster stores the rules created by the web service, using partition keys to map data to the appropriate database instance, likely through consistent hashing. All the tables mentioned above are stored in this key-value store, with data partitioned based on their keys.

Change Data Capture (CDC) Pattern

This system relies on capturing any new changes to the smart database store, such as rule creation or updates. Kafka connectors can be used to monitor table changes and publish the changes to Kafka as producers. Consumers in other services will then consume these rules and make API calls to the Rule Scheduler Service to create, update, or delete monitoring rules as needed.

Rule Scheduler Service

The Rule Scheduler Service is a critical and stateful component of this system design. Its stateful nature makes recovery during failures more challenging. To set up rule monitoring, frameworks like Quartz or Task Scheduler Java Framework can be used. This service generates events at specified intervals based on rule definitions. Each generated event is sent to Kafka for further evaluation and determination of any required user actions for their IoT devices.

Rule Evaluation Service

The Rule Evaluation service is a stateless component that processes each Kafka event. For every event, it calls the Telemetry API service to retrieve the necessary data points based on the user-defined window value in the rule. For instance, if there is one data point per minute stored in the Telemetry Timeseries Database, and the user wants to evaluate a 30-minute window, the service would fetch all 30 data points from the Telemetry API Service.

Once the data points are retrieved, the Rule Evaluation service assesses them based on the evaluation condition specified by the user. If the evaluation condition (e.g., the average of all values exceeding 20 percent) is met, the user will be alerted.

In this manner, the Rule Evaluation service efficiently processes incoming events, retrieves the relevant data points, and performs analysis to determine if user alerts are necessary. By doing so, it ensures that users are promptly notified of any issues and can take timely action to address them.

Telemetry Pipeline

Device Agent

The Device Agent is a software component that can either come pre-installed or be installed on a user's smart device. Users should configure their devices to meet the monitoring system's requirements before onboarding. This configuration ensures seamless communication between the smart device and Telemetry Ingestion Service using the designated API endpoints.

It's crucial to ensure the device has internet connectivity, allowing it to send data to the Telemetry Ingestion Service for evaluation and potential user notifications about ongoing issues. To prevent overwhelming the upstream services, the Device Agent can use batching techniques for more efficient data transmission.

The telemetry data should be timestamped and include any relevant information required for aggregation.

Telemetry Ingestion Service

This service is responsible for receiving batch data from the Device Agent and sending it to Kafka. The service also handles authentication and authorization of the data before accepting and saving it into Kafka.

Telemetry Aggregation Service

Leveraging Apache Spark's streaming framework, this service processes the ingested data. It groups the data by device and time intervals of 1 minute, applying an aggregation function such as average, sum, or count based on the specific use case and requirements. The function is applied to the data points within each 1-minute interval, generating aggregated results.

Timeseries Database

With a 1-minute window aggregated data points produced by the Telemetry Aggregation Service, a time-series database like DynamoDB or Apache Cassandra is needed to store data points for each device. The aggregated 1-minute data can be efficiently queried and analyzed by the Smart Monitoring System, allowing users to gain insights and receive alerts based on their custom rules.

Telemetry API Service

This stateless API service is designed to be consumed by any upstream service to fetch and evaluate data against the device within a given time frame. This approach ensures seamless data retrieval and analysis, supporting a robust and efficient smart monitoring system.

Follow up Questions asked by interviewers

How can we improve the resilience of raw telemetry data at the IoT device level?

To enhance the resilience of raw telemetry data on the IoT device side, the Device Agent can periodically upload batched data to the Data Ingestion Service while continuing to collect data locally. Data can be temporarily stored in the device's memory before being uploaded to the Data Ingestion Service. Once the data is uploaded, it can be deleted from the device, making room for new data. By preserving data at the device level for a short time before uploading, the system can prevent data loss in the event of connectivity issues with the Data Ingestion Service. Additionally, uploading data in batches helps reduce pressure on upstream services. During device configuration, users can be advised to select an upload frequency, providing the Smart Monitoring System with control over tuning the frequency of data uploads as well.

How can we enhance the reusability of the Telemetry Pipeline?

The Telemetry Pipeline should be designed to be generic and flexible enough to address various use cases required by the smart monitoring system. Instead of building the pipeline for specific use cases, it should be developed with the overall domain in mind, ensuring it can accommodate future requirements and adapt to changing needs.

For example, the Telemetry Aggregation Service currently aggregates data at the device level with a granularity of 1 minute. By maintaining high-cardinality metrics with a 1-minute granularity, the pipeline can support future services and expansions without requiring significant modifications. This approach ensures the pipeline remains adaptable and reusable for a wide range of potential use cases.

How do you handle data privacy and security in the IoT monitoring system, particularly when transmitting data from devices to the cloud?

To ensure data privacy and security in the IoT monitoring system, data transmitted from devices to the cloud is encrypted using industry-standard protocols such as TLS/SSL. Additionally, robust authentication and authorization mechanisms are employed to prevent unauthorized access to the system.

What are the possible bottlenecks in the current architecture, and how can they be mitigated?

Possible bottlenecks in the current architecture could include the data ingestion service, telemetry aggregation service, or rule scheduling service. To mitigate these, consider load balancing, horizontal scaling, and optimizing the processing algorithms to reduce latency and improve overall system performance.

Can the system support predictive analytics or machine learning models for more advanced monitoring and alerting capabilities?

Yes, the system can support predictive analytics or machine learning models for more advanced monitoring and alerting capabilities. By leveraging the granular telemetry data collected and aggregated in the Timeseries Database, machine learning algorithms can be trained and integrated into the rule evaluation process. This approach enables more sophisticated, data-driven insights and predictions, enhancing the monitoring system's ability to alert users of potential issues before they become critical.

Blueprint Bulletin: System Design and Beyond

Discussion about this post