Quickstart
Let's see how we can use TMLL modules quickly!
Initializing TMLL
Installing Trace Server
As indicated, TMLL leverages the outputs derived by Trace Server. Hence, you need to have Trace Server installed and running on your machine while using TMLL. Using this link, you can download the appropriate version of Trace Server for your machine, once downloaded, run its executable.
Creating a TMLL Client
Before using any specific module, you must initialize a TMLL client. The client manages trace files, including importing them, creating experiments, fetching outputs, and performing other related tasks.
from tmll.tmll_client import TMLLClient
# This client object will be used in the subsequent steps
client = TMLLClient(verbose=True)Importing Traces
The primary input for TMLL consists of trace files. Assuming you have already collected your trace files, importing them into the TMLL client is a simple process.
# We create an experiment from the trace files. This experiment will be also used in the subsequent steps.
experiment = client.create_experiment(traces=[
{
"path": "/path/to/the/first/trace", # Required
"name": "your_custom_name" # Optional. If absent, a default name would be assigned
},
{
"path": "/path/to/the/second/trace"
}
], experiment_name="EXPERIMENT_NAME")From this point forward, we will pass the client and experiment objects to the modules we want to use. To clarify, the client is responsible for communicating with the Trace Server, while the experiment handles various trace outputs (e.g., CPU and memory usage, flame charts, graphs, etc.).
Anomaly Detection
This group includes modules designed to identify abnormal data points (i.e., timestamps), with each module offering insights into anomalies from different perspectives.
TMLL does not guarantee that the identified data points are actual anomalies within the system, as it lacks access to the system's contextual information and cannot distinguish between normal and abnormal data points due to the unlabeled nature of trace files. Instead, TMLL uses statistical procedures and calculations to highlight data points that behave significantly differently from the rest, making them strong candidates for further investigation.
Anomaly Detection
The goal is to identify data points or time periods that deviate significantly from the system's general behavior.
Initializing the Module
Finding Anomalies
Currently, TMLL supports the following methods to pinpoint anomalies in the trace data:
zscore: Identifies anomalies based on how many standard deviations a data point is from the mean.
iqr: Detects anomalies by finding data points that lie outside the interquartile range (IQR).
moving_average: Flags anomalies based on deviations from a smoothed average over a moving window.
combined: The combination of zscore, iqr, and moving_average. It will provide more robust detection results, but omit some data points that don't belong in some methods.
iforest: Uses the Isolation Forest algorithm to isolate and identify anomalous data points.
seasonality: Detects anomalies by identifying deviations from expected seasonal patterns in the data. This should mostly be used for seasonal data, such as the systems' periodic tasks.
Plotting the Results
Assuming the anomaly detection module is initialized with CPU and Disk usage data, and the methodology is set to z-score (threshold = 2), anomalies can be visualized using the plot_anomalies method.



Memory Leak Detection
Poor memory management strategies can lead to memory leaks, causing significant damage to the system as the program runs for extended periods. This module focuses on identifying and highlighting issues such as unfreed allocated memory pointers or steadily increasing memory usage for the user.
Initializing the Module
Indicating the Memory Leaks
Plotting the Results




Interpreting the Results
Interpreting the analysis results from this module can be challenging for users who are not experts in analyzing memory behaviors and indicators. To assist users, TMLL offers an interpretation method that provides detailed analysis results, recommendations, and insights.
Root Cause Analysis
In the event of unexpected behaviors, the modules in this group can assist in identifying their root causes. This allows you to understand the causality and correlation between system components and their behavior over time.
Correlation Analysis
System components often influence each other's behavior, creating a chain of interdependent actions. For example, a CPU spike at a specific timestamp might result from a particular disk activity operation. This module enables you to analyze and understand the correlations between system components.
Initializing the Module
Finding the Correlations
TMLL automatically selects an appropriate correlation methodology (e.g., Pearson, Spearman, or Kendall) based on the characteristics of the data distributions. This allows you to determine the correlation for each pair of system components according to their specific attributes.
Plotting the Correlation Matrix

You can also generate a time-series plot comparison for the metrics to observe their behaviors over time.

Correlation Lag Analysis
The impact of different system components may occur with a delay, resembling a chain of actions where one component influences another step by step. For example, a spike in CPU usage might lead to an increase in memory usage after a short delay, rather than both events occurring simultaneously. TMLL provides an option to identify the lag between each component.

Performance Trend Analysis
Programs can experience performance shifts, behavioral changes, or unexpected regressions. Identifying these trends can be challenging, especially when dealing with numerous system components. The modules in this group are designed to uncover performance trends in trace data that traditional analyses might overlook.
Change Point Detection
This module detects moments when the statistical properties of a system metric change significantly, helping users identify sudden shifts in performance metrics such as CPU usage spikes or increased latency. Additionally, by aggregating different metrics (e.g., using PCA, Z-score, or voting), TMLL can more effectively pinpoint significant change points that may be difficult to identify through manual analysis of trace data.
Initializing the Module
Indicate the Change Points
You can pinpoint the significant change points based on different parameters.
Plotting the Change Points







Predictive Maintenance
The future characteristics of a system often depend on its historical behavior. By leveraging this historical data, it becomes possible to predict various aspects of the system's future, helping to prevent unexpected issues. This group includes features such as forecasting upcoming resource usage, building performance models to detect regressions, creating alarm systems, and more.
Capacity Planning (Forecasting)
The performance characteristics of system resources (i.e., CPU, memory, and disk usage) are fundamental to the overall system performance. These characteristics must remain stable, as unexpected increases in resource usage can lead to performance regressions. This module is designed to forecast the future usage of various system resources based on their historical observations and report any violations to you.
Initializing the Module
Forecasting the Resources Usage
To forecast different system components (e.g., CPU, memory, and disk usage), you can optionally specify custom threshold values for each resource. These thresholds define the maximum acceptable usage for each resource. If the forecast predicts usage exceeding the threshold, the module will flag it as a violation. If this violation persists beyond a defined time period (e.g., 10 ms, 15 minutes, 1 hour, etc.), it will be raised as an alarm. You can choose from various forecasting methods, including AutoRegressive Integrated Moving Average (ARIMA), Vector AutoRegressive (VAR), or Moving Average.
Plotting the Forecasts



Interpreting the Results
Interpreting the forecast itself is relatively straightforward. However, conducting a more detailed and precise analysis can be challenging, as it requires considering multiple factors, such as the number of violations, the duration of each violation, and the recommended optimizations or actions to take. With this module's interpretation feature, you can access all this information for each metric (e.g., CPU or memory).
Resource Optimization
System resources such as CPU, memory, and disk I/O are fundamental components that directly impact overall system performance. Inefficient resource utilization can lead to various issues, including performance bottlenecks, increased latency, and unnecessary costs. The modules in this group help identify underutilized resources and provide optimization recommendations to improve system efficiency. This includes analyzing idle periods of different resources, detecting load imbalances, and more.
Idle Resource Detection
Each system resource may experience idle periods, which are generally normal. However, if these idle periods exceed a specific duration, it may indicate that the resources are underutilized and could require adjustments. This module analyzes the idle status of each system resource individually and provides a more detailed analysis of CPU scheduling.
Initializing the Module
Indicating Idle Resources
You can define specific thresholds for each resource (i.e., CPU, memory, and disk usage) as well as a threshold for idle time, which indicates an idle period when resource usage remains below the defined threshold for a specified duration.
Plotting the Idle Resources Results



Analyzing CPU Scheduling
In addition to general idle resource analysis, this module provides a detailed analysis of CPU scheduling, offering insights into the characteristics of each CPU core. You can identify the most resource-intensive processes or tasks on each core, the number of context switches performed, how load balancing was managed among the cores, and more.
Plotting the CPU Scheduling Results





Interpreting the Results
You can access detailed information about each resource, as well as CPU scheduling results, using the interpretation method. Based on these results, the method also provides various optimization recommendations to help you improve the efficiency of the system's resources.
Last updated