Context

Time series Data is collection of metrics or quantities that are gathered over even interval in time and ordered chronologically. There is a potential for co-relation between the observations at different time periods. The time interval at which data is collected is called time series frequency. Examples of few devices generating time series data is IOT sensors and Network Routers. Time series data for performance monitoring and anomaly detection can be collected from Network Routers using Simple Network Management Protocol (SNMP), Real-Time Performance Monitoring (RPM) probes and telemetry. RPM statistics are gathered by configuring active probes on the router to track and monitor traffic across the network and to investigate network problems. Probes collect packets per destination and per application, including PING Internet Control Message Protocol (ICMP) packets, User Datagram Protocol and Transmission Control Protocol (UDP/TCP) packets with user-configured ports, user-configured Differentiated Services code point (DSCP) type-of-service (ToS) packets, and Hypertext Transfer Protocol (HTTP) packets. The probes can monitor average round trip time (RTT), Jitter (the difference between minimum and maximum round-trip time), maximum round-trip time, minimum round-trip time, packet loss enabling network operators to accurately measure the performance between two network end points. This requires the data collected from the probes to be processes and analysed in real time with very low latency. Influx or Timescale database are some database products purpose built for time series database, but this blog looks at use of Couchbase as Time Series Database for processing and analysing RPM Data from Juniper routers used in IP/MPLS Networks within Video Contribution and Primary Distribution. The diagram below outlines E2E Video Delivery Chain with Video Contribution and Primary Distribution.

Time series data collected from the RPM Probes is used to calculate Link Level performance metrics to check for Jitter or Delay to ensure there is no problem in Video Transmission. Any issue with Video transmission in the contribution and distribution space will impact a very large customer base.

The important considerations in choosing Couchbase is support for flexible data model with Performance, Availability and Reliability at scale along with support for Real time analytics and eventing.

Modelling Data from Router RPM Probes

The diagram below shows two routers connected through a Link. The RPM Probe is configured on the Juniper router through a SLAX Script. The probe monitors the interfaces on the router connected to Link for carrying the traffic from one router host to another host. The probe is configured to send RPM metrics every 15 seconds. Metrics land on to a syslog server. The Data from syslog server is written into Kafka using a Kafka consumer. Data from Kafka is then moved into couchbase using Couchbase Kafka Connector. The solution looks at a Data Model for Couchbase version 6.5.1 supporting Key Value access for Link Level Performance monitoring.

Given below is a Data Model for the Link connecting with Router A and Router B referred to as Link Catalog document in Link Catalog Bucket

Key : < Link  ID>

JSON Document

{

“type”:”LinkCatalog”, /* Type of Document */

/* Which Routers with their Host Name is this Link Connected to */

“a.uk.org”:  / * Router A Host Name */

{

“start”:1593561600000, /* Time at which the First Performance metric was collected for

“interface”: “et-1/1/0”     /* Interface name */

“durationSecs”: 15.   /* Polling interval duration of Metric collection */

},

“b.uk.org”:

{

“start”:1593561600000,

“interface”: “et-0/1/4”,

“durationSecs”: 15

}

}

For the Transmission metrics at Router A for the Interface et-1/1/0 for the first Poll at time 1592223262000 specified in Link Catalog document in Metrics Bucket in Couchbase

Key :<Link ID>|<Host Name>|0

{

“type”:”metric”,

“subttype”:”raw”,

“ts”:1593561600000,
“maxRtt”: 56.5,
“averageRtt”: 4.54,
“minRtt”: 0.54,
“jitter”: 55.957,
“loss”: 0

}

For the next poll 15 seconds after 1592223262000

Key :<Link ID>|<Host Name>|1

{

“type”:”metric”,

“subttype”:”raw”,

“ts”:1593561600000,
“maxRtt”: 56.5,
“averageRtt”: 4.54,
“minRtt”: 0.54,
“jitter”: 55.957,
“loss”: 0

}

Similar to the above metrics document for the Reception metrics at Router B for the Interface et-0/1/4 for the first Poll at time 1592223685000 specified in Link Catalog and subsequent polls is created

The logic from an application reading Data from the Router RPM Probe or Polling through SNMP will be as below. Application gets metrics every 15 seconds as specified in the JSON Payload

{

“maxRtt”: 17.92,
“hostName”: “a.uk.org”,
“loss”: 0,
“jitter”: 17.438,
“averageRtt”: 2.75,
“interfaceId”: “et-1/1/0”,
“minRtt”: 0.48,
“type”: “RPM Probe”,
“LinkId”: “Link1234”,
“timestamp”: 1594549279000
}

Application constructs Key using the Link Id and checks if this Key already exists in Couchbase if it does not exist then it creates the Link Catalog Document with Link Id as Key and Places the Hostname along with Interface Id and timestamp of the first Poll. If the Key exists it checks if the Host Name in the Link Catalog document matches to the host name in the metric if it does not match it adds that additional host, interface id and time stamp as first poll timestamp. The duration value under the host name respectively is set to Polling Interval.

With this Data model an application can extract the metrics for Link ( For eg Link ID Link 1234) within a specified time period (13 July 2020 05:00:00 to 13 July 2020 05:30:00) to get Max (Packet Loss ), Max ( Jitter), Max ( average Round Trip Time ) using Couchbase Key Value access.

Link1234 is connected to Routers with Host name a.uk.org and b.uk.org

Link1234 RPM Probe data for host name a.uk.org and b.uk.org started on 01 July 2020 at 01:00:00 . Polling Interval is 15 seconds.

Metric document Key corresponding to Link 1234 for a.uk.org for the start time period 13 July 2020 05:00:00 =Link1234|a.uk.org| ( Epoch Time of 13 July 2020 05:00:00 ) – ( Epoch Time of 01 July 2020 01:00:00)/15000

 

Metric document Key corresponding to Link 1234 for b.uk.org for the start time period 13 July 2020 05:00:00 =Link1234|b.uk.org| ( Epoch Time of 13 July 2020 05:00:00 ) – ( Epoch Time of 01 July 2020 01:00:00)/15000

The total number of key value fetches for 30 Minutes interval for host a.uk.org will be 120 ( 30 Minutes * 60 seconds / 15 second polling interval). Similarly for b.uk.org will be 120 documents.

 

The application will then calculate the Transmission side metrics corresponding to Host a.uk.org as Max ( Jitter), Max (Packet Loss ), Max ( average Round Trip Time ) among  all 120 Metrics type document retrieved above. For the reception side metrics it will use the 120 metric document retrieved for host name b.uk.org. The metrics Data can be fetched and Plotted on a UI built in Grafana to indicate the maxRTT, min RTT, Jitter and Packet Loss for each host type connected to a Link over the Polling Time interval 15 seconds or a larger interval of one minute or 5 minutes

Use of Eventing to Create Aggregate Documents

To Create aggregate document for a Link Hour, Day and Month can use the couchbase eventing function as described below. There is a Limitation in Couchbase 6.5 a function that is invoked by a timer callback cannot reliably create a fresh timer. To create hourly aggregates requires a function to run every hour to aggregate all 15 second interval and create an aggregate view for that hour.

For every Link ID document in the Link Catalog Bucket create a Link Aggregate scheduler document as below

Key :<Link ID>|AGGS|H

{

“type”: “LAS”, /* LAS Link Aggregate Scheduler */

“created”:1593561600000, /* Time at which this document was Created */

“updated”: 1593561600000, /* Time at which the document was Updated */
“Id”: “Link1234″,  /* Link ID */

“AGGSTYPE”: “Hour” /* Aggregate Scheduler Type */

}

The eventing function hourlyAggregateScheduler will be fired onUpdate of documents in Link Catalog Bucket. The eventing function will filter and check the document type is LAS. If it is LAS then it Creates a Timer to fire after an Hour from the current time.

Upon the execution of the call back function associated with the timer it will perform the same logic as explained above for an application to extract metrics for a link within a specified time period. The time period will be difference in time between current time and updated time in Link Aggregate scheduler Document ( LAS ). Once all the metrics have been fetched it will then calculate the Max ( Jitter), Max (Packet Loss ), Max ( average Round Trip Time, Min ( min RTT0, max ( max RTT) and write the data in the metrics  bucket in a new document as shown below

Key :<Link ID>|<Host Name>|AGG| Date Time |< Hour Interval>

Example Key LINK1234|a.uk.org|AGG|27-12-2020|5 ( Indicating aggregate of 5th Hour of 27th July)

{

“type”:”metric”,

“subtype”:”aggregate”,

“maxRtt”: 56.5,
“averageRtt”: 4.54,
“minRtt”: 0.54,
“jitter”: 55.957,
“loss”: 0

}

There can be Alerts creaeted once the metrics have been collected for a significant amount of time to identify baseline raw/aggregate metrics for link with best performance and compare it against incoming metrics realtime.

Posted by Mritunjay

Senior Solutions Engineer at Couchbase

Leave a reply