Quickstart

Let's see how we can use TMLL modules quickly!

Initializing TMLL

Installing Trace Server

TMLL includes a built-in feature to automatically download and install Trace Server on your machine. However, this functionality is currently available only for Linux and Windows platforms. If you're using one of these platforms, you can skip this section. Otherwise, you will need to manually download and run Trace Server.

As indicated, TMLL leverages the outputs derived by Trace Server. Hence, you need to have Trace Server installed and running on your machine while using TMLL. Using this link, you can download the appropriate version of Trace Server for your machine, once downloaded, run its executable.

By default, you can check the running status of Trace Server by visiting http://localhost:8080/tsp/api/health.

Creating a TMLL Client

Before using any specific module, you must initialize a TMLL client. The client manages trace files, including importing them, creating experiments, fetching outputs, and performing other related tasks.

from tmll.tmll_client import TMLLClient

# This client object will be used in the subsequent steps
client = TMLLClient(verbose=True)

Importing Traces

The primary input for TMLL consists of trace files. Assuming you have already collected your trace files, importing them into the TMLL client is a simple process.

# We create an experiment from the trace files. This experiment will be also used in the subsequent steps.
experiment = client.create_experiment(traces=[
    {
        "path": "/path/to/the/first/trace", # Required
        "name": "your_custom_name" # Optional. If absent, a default name would be assigned
    },
    {
        "path": "/path/to/the/second/trace"
    }
], experiment_name="EXPERIMENT_NAME")

From this point forward, we will pass the client and experiment objects to the modules we want to use. To clarify, the client is responsible for communicating with the Trace Server, while the experiment handles various trace outputs (e.g., CPU and memory usage, flame charts, graphs, etc.).

Anomaly Detection

This group includes modules designed to identify abnormal data points (i.e., timestamps), with each module offering insights into anomalies from different perspectives.

Anomaly Detection

The goal is to identify data points or time periods that deviate significantly from the system's general behavior.

Initializing the Module

Finding Anomalies

Currently, TMLL supports the following methods to pinpoint anomalies in the trace data:

  • zscore: Identifies anomalies based on how many standard deviations a data point is from the mean.

  • iqr: Detects anomalies by finding data points that lie outside the interquartile range (IQR).

  • moving_average: Flags anomalies based on deviations from a smoothed average over a moving window.

  • combined: The combination of zscore, iqr, and moving_average. It will provide more robust detection results, but omit some data points that don't belong in some methods.

  • iforest: Uses the Isolation Forest algorithm to isolate and identify anomalous data points.

  • seasonality: Detects anomalies by identifying deviations from expected seasonal patterns in the data. This should mostly be used for seasonal data, such as the systems' periodic tasks.

Plotting the Results

Assuming the anomaly detection module is initialized with CPU and Disk usage data, and the methodology is set to z-score (threshold = 2), anomalies can be visualized using the plot_anomalies method.

CPU usage over time. As visible, no significant anomalies were detected via zscore (threshold =2) method.
Disk usage over time. Two anomaly points were identified for this resource.
The combination of CPU and disk usage (analyzed using PCA) over time revealed three significant anomaly periods (consistent anomaly points within specific time intervals) and several isolated anomaly points.

Memory Leak Detection

Poor memory management strategies can lead to memory leaks, causing significant damage to the system as the program runs for extended periods. This module focuses on identifying and highlighting issues such as unfreed allocated memory pointers or steadily increasing memory usage for the user.

Initializing the Module

Indicating the Memory Leaks

Plotting the Results

Memory usage over time is displayed alongside a trend line derived from a linear regression model.
Memory operations over time, illustrating the number of allocation and deallocation operations performed.
A bar plot displaying the lifetime of memory pointers over time.
The fragmentation score (in %) of the memory as program executes.

Interpreting the Results

Interpreting the analysis results from this module can be challenging for users who are not experts in analyzing memory behaviors and indicators. To assist users, TMLL offers an interpretation method that provides detailed analysis results, recommendations, and insights.

Root Cause Analysis

In the event of unexpected behaviors, the modules in this group can assist in identifying their root causes. This allows you to understand the causality and correlation between system components and their behavior over time.

Correlation Analysis

System components often influence each other's behavior, creating a chain of interdependent actions. For example, a CPU spike at a specific timestamp might result from a particular disk activity operation. This module enables you to analyze and understand the correlations between system components.

Initializing the Module

Finding the Correlations

TMLL automatically selects an appropriate correlation methodology (e.g., Pearson, Spearman, or Kendall) based on the characteristics of the data distributions. This allows you to determine the correlation for each pair of system components according to their specific attributes.

Plotting the Correlation Matrix

The correlation matrix for various metrics over the entire trace period shows that CPU and disk usage have a high positive correlation, memory usage and histogram exhibit a moderate positive correlation, and the remaining components have minimal impact on each other.

You can also generate a time-series plot comparison for the metrics to observe their behaviors over time.

Time-series plot for different system components over time.

Correlation Lag Analysis

The impact of different system components may occur with a delay, resembling a chain of actions where one component influences another step by step. For example, a spike in CPU usage might lead to an increase in memory usage after a short delay, rather than both events occurring simultaneously. TMLL provides an option to identify the lag between each component.

The lag analysis between the histogram (number of events at each timestamp) and memory usage indicates a lag of -1 (or +1 when comparing memory usage to the histogram). This suggests a one-unit timestamp difference in the impact of one component on the other.

Performance Trend Analysis

Programs can experience performance shifts, behavioral changes, or unexpected regressions. Identifying these trends can be challenging, especially when dealing with numerous system components. The modules in this group are designed to uncover performance trends in trace data that traditional analyses might overlook.

Change Point Detection

This module detects moments when the statistical properties of a system metric change significantly, helping users identify sudden shifts in performance metrics such as CPU usage spikes or increased latency. Additionally, by aggregating different metrics (e.g., using PCA, Z-score, or voting), TMLL can more effectively pinpoint significant change points that may be difficult to identify through manual analysis of trace data.

Initializing the Module

Indicate the Change Points

You can pinpoint the significant change points based on different parameters.

Plotting the Change Points

Top-2 significant change points for CPU usage (individually).
Top-2 significant change points for Disk usage (individually).
Top-2 significant change points for Histogram (number of events) (individually).
Top-2 significant change points for Memory usage (individually).
Top-2 significant change points the aggregation of metrics using Z-Score.
Top-2 significant change points the aggregation of metrics using Voting.
Top-2 significant change points the aggregation of metrics using PCA.

Predictive Maintenance

The future characteristics of a system often depend on its historical behavior. By leveraging this historical data, it becomes possible to predict various aspects of the system's future, helping to prevent unexpected issues. This group includes features such as forecasting upcoming resource usage, building performance models to detect regressions, creating alarm systems, and more.

Capacity Planning (Forecasting)

The performance characteristics of system resources (i.e., CPU, memory, and disk usage) are fundamental to the overall system performance. These characteristics must remain stable, as unexpected increases in resource usage can lead to performance regressions. This module is designed to forecast the future usage of various system resources based on their historical observations and report any violations to you.

Initializing the Module

Forecasting the Resources Usage

To forecast different system components (e.g., CPU, memory, and disk usage), you can optionally specify custom threshold values for each resource. These thresholds define the maximum acceptable usage for each resource. If the forecast predicts usage exceeding the threshold, the module will flag it as a violation. If this violation persists beyond a defined time period (e.g., 10 ms, 15 minutes, 1 hour, etc.), it will be raised as an alarm. You can choose from various forecasting methods, including AutoRegressive Integrated Moving Average (ARIMA), Vector AutoRegressive (VAR), or Moving Average.

Plotting the Forecasts

Forecasting CPU usage using the Moving Average method reveals a time period where usage exceeds the threshold of 130% for more than 50 milliseconds.
The memory usage forecast does not indicate any problematic timestamps.
There is a time period where disk usage is forecasted to exceed the threshold of 275 MB/s for more than 50 milliseconds.

Interpreting the Results

Interpreting the forecast itself is relatively straightforward. However, conducting a more detailed and precise analysis can be challenging, as it requires considering multiple factors, such as the number of violations, the duration of each violation, and the recommended optimizations or actions to take. With this module's interpretation feature, you can access all this information for each metric (e.g., CPU or memory).

Resource Optimization

System resources such as CPU, memory, and disk I/O are fundamental components that directly impact overall system performance. Inefficient resource utilization can lead to various issues, including performance bottlenecks, increased latency, and unnecessary costs. The modules in this group help identify underutilized resources and provide optimization recommendations to improve system efficiency. This includes analyzing idle periods of different resources, detecting load imbalances, and more.

Idle Resource Detection

Each system resource may experience idle periods, which are generally normal. However, if these idle periods exceed a specific duration, it may indicate that the resources are underutilized and could require adjustments. This module analyzes the idle status of each system resource individually and provides a more detailed analysis of CPU scheduling.

Initializing the Module

Indicating Idle Resources

You can define specific thresholds for each resource (i.e., CPU, memory, and disk usage) as well as a threshold for idle time, which indicates an idle period when resource usage remains below the defined threshold for a specified duration.

Plotting the Idle Resources Results

The CPU utilization over time, highlighting one idle period lasting more than 750 milliseconds.
The memory usage over time, highlighting one idle period lasting more than 750 milliseconds.
The disk usage over time, highlighting one idle period lasting more than 750 milliseconds.

Analyzing CPU Scheduling

In addition to general idle resource analysis, this module provides a detailed analysis of CPU scheduling, offering insights into the characteristics of each CPU core. You can identify the most resource-intensive processes or tasks on each core, the number of context switches performed, how load balancing was managed among the cores, and more.

Plotting the CPU Scheduling Results

The CPU utilization heatmap for each core over time illustrates how load balancing is managed across the cores.
The most resource-intensive tasks on the first CPU core over time.
The usage distribution of the top 25 tasks on the first CPU core, expressed as a percentage of total usage.
The most resource-intensive tasks on the second CPU core over time.
The usage distribution of the top 25 tasks on the second CPU core, expressed as a percentage of total usage.

Interpreting the Results

You can access detailed information about each resource, as well as CPU scheduling results, using the interpretation method. Based on these results, the method also provides various optimization recommendations to help you improve the efficiency of the system's resources.

Last updated