SharkFest 2014 - Packet Analysis and Visualization with SteelScript

This presentation was delivered at SharkFest on June 17, 2014 by Christopher J. White. The source file SharkFest2014.ipynb is an IPython Notebook.

Overview

  • Visualizing with SteelScript Application Framework
  • Tools in my toolbox
  • Python Pandas
  • PCAP Analysis with SteelScript

SteelScript Application Framework

PCAP File: /ws/sharkfest2014/oneday.pcap

  • ip.len field over time
  • 95th and 80th percential
  • Exponential Weighted Moving Average (EWMA)

Tools: IPython

  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

Installation

> pip install ipython

Tools: pandas - Python Data Analysis Library

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

  • Series - array of data with an optional index
  • DataFrame - 2D array of data hierarchical row and column indexing

Installation

Linux / Mac with dev tools?

> pip install pandas

Otherwise see pandas.pydata.org

Tools: matlibplot - Python Plotting

matlibplot hooks in with IPython notebook to provide in browser graphs.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2)
y = np.sin(4 * np.pi * x) * np.exp(-5 * x)

plt.fill(x, y, 'r')
plt.grid(True)
plt.show()

Tools: SteelScript Application Framework

  • Web front end for simple user interface
  • Design custom reports
    • mix and match widgets and data
    • define custom criteria
  • Custom analysis via Python hooks and Python Pandas
    • compute statistics, pivot tables, merge, sort, resample timeseries
  • Plugin architecture makes it easy to share modules

pandas

Primary object types:

  • Series - 1 dimensional array of element
  • DataFrames - 2 dimensional array of elements

Data is stored is compact binary form for efficiency of operation.

Supports both row and column indexing by name.

pandas: Series

A Series is similar to a standard Python list, however much more efficient in memory and computation.

In [2]:
import pandas, numpy
s = pandas.Series([10, 23, 19, 15, 56, 15, 41])
print type(s)
<class 'pandas.core.series.Series'>

In [3]:
s
Out[3]:
0    10
1    23
2    19
3    15
4    56
5    15
6    41
dtype: int64
In [4]:
s.sum(), s.min(), s.max(), s.mean()
Out[4]:
(179, 10, 56, 25.571428571428573)

Consider processing 1,000,000 random entries in a standard list:

In [5]:
%%time
s = list(numpy.random.randn(1000000))
CPU times: user 147 ms, sys: 19.1 ms, total: 166 ms
Wall time: 166 ms

In [6]:
%%time
print min(s), max(s), sum(s)
-5.15466016548 4.83339730438 1070.4119682
CPU times: user 384 ms, sys: 966 µs, total: 385 ms
Wall time: 384 ms

Now, consider processing 1,000,000 random entries in a pandas Series:

In [7]:
%%time
s = pandas.Series(numpy.random.randn(1000000))
len(s)
CPU times: user 153 ms, sys: 12.6 ms, total: 166 ms
Wall time: 166 ms

In [8]:
%%time
print s.min(), s.max(), s.sum()
-4.88164961037 4.76370359345 1151.10605833
CPU times: user 19.6 ms, sys: 5.19 ms, total: 24.8 ms
Wall time: 24.1 ms

In [9]:
del s

pandas: DataFrame

A DataFrame is typically loaded from a file, but my be created from a list of lists:

In [10]:
import pandas
df = pandas.DataFrame(
    [['Boston', '10.1.1.1', 10, 2356, 0.100],
     ['Boston', '10.1.1.2', 23, 16600, 0.112],
     ['Boston', '10.1.1.15', 19, 22600, 0.085],
     ['SanFran', '10.38.5.1', 15, 10550, 0.030],
     ['SanFran', '10.38.8.2', 56, 35000, 0.020],
     ['London', '192.168.4.6', 15, 3400, 0.130],
     ['London', '192.168.5.72', 41, 55000, 0.120]],
     columns = ['location', 'ip', 'pkts', 'bytes', 'rtt'])

Each column's data type is automatically detected based on contents.

In [11]:
df.dtypes
Out[11]:
location     object
ip           object
pkts          int64
bytes         int64
rtt         float64
dtype: object

IPython has special support for displaying DataFrames

In [12]:
df
Out[12]:
location ip pkts bytes rtt
0 Boston 10.1.1.1 10 2356 0.100
1 Boston 10.1.1.2 23 16600 0.112
2 Boston 10.1.1.15 19 22600 0.085
3 SanFran 10.38.5.1 15 10550 0.030
4 SanFran 10.38.8.2 56 35000 0.020
5 London 192.168.4.6 15 3400 0.130
6 London 192.168.5.72 41 55000 0.120

7 rows × 5 columns

pandas: DataFrame operations

The DataFrame supports a number of operations directly

In [13]:
df.mean()
Out[13]:
pkts        25.571429
bytes    20786.571429
rtt          0.085286
dtype: float64

Notice that location and ip are string columns, thus there is no operation 'mean' for such columns.

In [14]:
df.sum()
Out[14]:
location         BostonBostonBostonSanFranSanFranLondonLondon
ip          10.1.1.110.1.1.210.1.1.1510.38.5.110.38.8.2192...
pkts                                                      179
bytes                                                  145506
rtt                                                     0.597
dtype: object

In the case of sum(), it is possible to add strings, so location and ip are included in the results. (Be careful, on large data sets this can backfire!)

pandas: Selecting a Column

Each column of data in a DataFrame can be extracted by the standard indexing operator:

df[_colname_]

The result is a Series.

In [15]:
df['location']
Out[15]:
0     Boston
1     Boston
2     Boston
3    SanFran
4    SanFran
5     London
6     London
Name: location, dtype: object
In [16]:
df['pkts']
Out[16]:
0    10
1    23
2    19
3    15
4    56
5    15
6    41
Name: pkts, dtype: int64

pandas: Selecting Rows

Selecting a subset of rows based by index works just like array slicing:

df[_start_:_end_]

Note that the : is always required, but <start> and <end> are optional. This will return all rows with an index greater than or equal to <start>, up to but not including <end>

In [17]:
df[:2]
Out[17]:
location ip pkts bytes rtt
0 Boston 10.1.1.1 10 2356 0.100
1 Boston 10.1.1.2 23 16600 0.112

2 rows × 5 columns

In [18]:
df[2:4]
Out[18]:
location ip pkts bytes rtt
2 Boston 10.1.1.15 19 22600 0.085
3 SanFran 10.38.5.1 15 10550 0.030

2 rows × 5 columns

pandas: Filtering Rows

DataFrame rows can be filtered using boolean expressions

df[_boolean_expression_]
In [19]:
df[df['location'] == 'Boston']
Out[19]:
location ip pkts bytes rtt
0 Boston 10.1.1.1 10 2356 0.100
1 Boston 10.1.1.2 23 16600 0.112
2 Boston 10.1.1.15 19 22600 0.085

3 rows × 5 columns

In [20]:
df[df['pkts'] < 20]
Out[20]:
location ip pkts bytes rtt
0 Boston 10.1.1.1 10 2356 0.100
2 Boston 10.1.1.15 19 22600 0.085
3 SanFran 10.38.5.1 15 10550 0.030
5 London 192.168.4.6 15 3400 0.130

4 rows × 5 columns

Expressions can be combined to provide more complex filtering:

In [21]:
df[(df['location'] == 'Boston') & (df['pkts'] < 20)]
Out[21]:
location ip pkts bytes rtt
0 Boston 10.1.1.1 10 2356 0.100
2 Boston 10.1.1.15 19 22600 0.085

2 rows × 5 columns

The boolean expression is actually a Series of True/False values of the same length as the number of rows, thus can actually be assigned as a series:

In [22]:
bos = df['location'] == 'Boston'
print type(bos)
bos
<class 'pandas.core.series.Series'>

Out[22]:
0     True
1     True
2     True
3    False
4    False
5    False
6    False
Name: location, dtype: bool
In [23]:
pkts_lt_20 = (df['pkts'] < 20)
df[bos & pkts_lt_20]
Out[23]:
location ip pkts bytes rtt
0 Boston 10.1.1.1 10 2356 0.100
2 Boston 10.1.1.15 19 22600 0.085

2 rows × 5 columns

Use single "&" and "|" and parenthesis for constructing arbitrary boolean expressions. Use "~" for negation.

The result of a filtering operation is a new DataFrame object

In [24]:
df2 = df[bos & pkts_lt_20]
print 'Length:', len(df2)
df2
Length: 2

Out[24]:
location ip pkts bytes rtt
0 Boston 10.1.1.1 10 2356 0.100
2 Boston 10.1.1.15 19 22600 0.085

2 rows × 5 columns

pandas: Selecting Multiple Columns

A new DataFrame can be constructed from a subset of another DataFrame using the ix[] operator.

DataFrame.ix[_row_expr_,_column_lise_]
In [25]:
locrtt = df.ix[:,['location', 'rtt']]
locrtt
Out[25]:
location rtt
0 Boston 0.100
1 Boston 0.112
2 Boston 0.085
3 SanFran 0.030
4 SanFran 0.020
5 London 0.130
6 London 0.120

7 rows × 2 columns

The first argument to the ix[] indexer is actually expected to be a boolean Series. This makes it possible to select rows and columns in a single operation. The use of ":" above is short-hand for all rows.

In [26]:
boslocrtt = df.ix[bos,['location', 'rtt']]
boslocrtt
Out[26]:
location rtt
0 Boston 0.100
1 Boston 0.112
2 Boston 0.085

3 rows × 2 columns

pandas: Adding Columns

New columns (Series) be added to a DataFrame in one of three ways:

  • A constant value for all rows
  • Supplying a list matching the number of rows
  • Expression based on constants and Series objects of the same size
In [27]:
df['co'] = 'RVBD'
df['proto'] = ['tcp', 'tcp', 'udp', 'udp', 'tcp', 'tcp', 'udp']
df['Bpp'] = df['bytes'] / df['pkts']   # Bytes per packet
df['bpp'] = 8 * df['bytes'] / df['pkts']
df['slow'] = df['rtt'] >= 0.1
df.ix[:,['ip', 'proto', 'Bpp', 'bpp', 'slow', 'delay', 'co']]
Out[27]:
ip proto Bpp bpp slow delay co
0 10.1.1.1 tcp 235.600000 1884.800000 True NaN RVBD
1 10.1.1.2 tcp 721.739130 5773.913043 True NaN RVBD
2 10.1.1.15 udp 1189.473684 9515.789474 False NaN RVBD
3 10.38.5.1 udp 703.333333 5626.666667 False NaN RVBD
4 10.38.8.2 tcp 625.000000 5000.000000 False NaN RVBD
5 192.168.4.6 tcp 226.666667 1813.333333 True NaN RVBD
6 192.168.5.72 udp 1341.463415 10731.707317 True NaN RVBD

7 rows × 7 columns

pandas: Grouping and Aggregating

A common operation is to group data by a key column or set of of key columns. The groups rows that share the same value for all key columns producing one row per unique key pair:

df.groupby(_key_columns_]

For example, to group by the location column:

In [28]:
gb = df.groupby('location')
print type(gb)
gb.indices
<class 'pandas.core.groupby.DataFrameGroupBy'>

Out[28]:
{'Boston': array([0, 1, 2]), 'London': array([5, 6]), 'SanFran': array([3, 4])}

The result of a grouping operation is a GroupBy object. The above indices shows you the row numbers for the grouped data. The data of a GroupBy object cannot be inspected directly until aggregated.

The groupby() operation is usually followed by an aggregate() call to combine the values of all related rows in each column.

In one form, the aggregate() call takes a dictionary indicating the operation to perform on each column to comine values:

GroupBy.aggregate({<colname>: <operation>,
                   <colname>: <operation>,
                   ... })
                   

Only columns listed in the dictionary will be returned in the resulting DataFrame.

In [29]:
gb.aggregate({'pkts': 'sum',
              'rtt': 'mean'})
Out[29]:
pkts rtt
location
Boston 52 0.099
London 56 0.125
SanFran 71 0.025

3 rows × 2 columns

By copying columns, it is possible to compute alternate operations on the same data, producing different aggregated results:

In [30]:
df2 = df.ix[:,['location', 'pkts', 'rtt']]
df2['peak_rtt'] = df['rtt']
df2['min_rtt'] = df['rtt']
df2
Out[30]:
location pkts rtt peak_rtt min_rtt
0 Boston 10 0.100 0.100 0.100
1 Boston 23 0.112 0.112 0.112
2 Boston 19 0.085 0.085 0.085
3 SanFran 15 0.030 0.030 0.030
4 SanFran 56 0.020 0.020 0.020
5 London 15 0.130 0.130 0.130
6 London 41 0.120 0.120 0.120

7 rows × 5 columns

In [31]:
agg = (df2.groupby('location')
          .aggregate({'pkts': 'sum',
                      'rtt': 'mean',
                      'peak_rtt': 'max',
                      'min_rtt': 'min'}))
agg
Out[31]:
pkts rtt min_rtt peak_rtt
location
Boston 52 0.099 0.085 0.112
London 56 0.125 0.120 0.130
SanFran 71 0.025 0.020 0.030

3 rows × 4 columns

pandas: Indexing

Notice that the output of the preview groupby/aggregate operation looked a bit different from previous data frames, the first column is in bold:

In [32]:
agg
Out[32]:
pkts rtt min_rtt peak_rtt
location
Boston 52 0.099 0.085 0.112
London 56 0.125 0.120 0.130
SanFran 71 0.025 0.020 0.030

3 rows × 4 columns

The bold rows/columns indicate that the DataFrame is indexed.

Let's look at the DataFrame without the index -- by calling reset_index():

In [33]:
agg.reset_index()
Out[33]:
location pkts rtt min_rtt peak_rtt
0 Boston 52 0.099 0.085 0.112
1 London 56 0.125 0.120 0.130
2 SanFran 71 0.025 0.020 0.030

3 rows × 5 columns

Indexing a DataFrame makes the data in the indexed column to meta-data associated with the object. This index is carried on to Series objects:

In [34]:
agg['pkts']
Out[34]:
location
Boston      52
London      56
SanFran     71
Name: pkts, dtype: int64

The resulting series still acts like an array, but may be indexed by either a numeric row value or a location name:

In [35]:
print "Boston:", agg['pkts']['Boston']
print "Item 0:", agg['pkts'][0]
Boston: 52
Item 0: 52

In [36]:
df = df.ix[:,['location', 'ip', 'pkts', 'bytes', 'rtt', 'slow']]

pandas: Modifying a Subset of Rows

Often it is useful to assign a subset of rows in a column. This is possible by assigning a value to the result of the ix[] indexer:

DataFrame.ix[_row_expr_, _column_list_>] = <new value>

When used in this form, the DataFrame object indexed is modified in place.

Let's use this method to assign a value of 'slow', 'normal', or 'fast' to a new 'delay' column based on rtt:

In [37]:
df['delay'] = ''
df
Out[37]:
location ip pkts bytes rtt slow delay
0 Boston 10.1.1.1 10 2356 0.100 True
1 Boston 10.1.1.2 23 16600 0.112 True
2 Boston 10.1.1.15 19 22600 0.085 False
3 SanFran 10.38.5.1 15 10550 0.030 False
4 SanFran 10.38.8.2 56 35000 0.020 False
5 London 192.168.4.6 15 3400 0.130 True
6 London 192.168.5.72 41 55000 0.120 True

7 rows × 7 columns

Compute boolean Series for each range of rtt:

In [38]:
slow_rows = (df['rtt'] > 0.110)
normal_rows = ((df['rtt'] > 0.050) & (~slow_rows))
fast_rows = ((~slow_rows) & (~normal_rows))
In [39]:
df.ix[slow_rows,'delay'] = 'slow'
df.ix[normal_rows, 'delay'] = 'normal'
df.ix[fast_rows, 'delay'] = 'fast'
df
Out[39]:
location ip pkts bytes rtt slow delay
0 Boston 10.1.1.1 10 2356 0.100 True normal
1 Boston 10.1.1.2 23 16600 0.112 True slow
2 Boston 10.1.1.15 19 22600 0.085 False normal
3 SanFran 10.38.5.1 15 10550 0.030 False fast
4 SanFran 10.38.8.2 56 35000 0.020 False fast
5 London 192.168.4.6 15 3400 0.130 True slow
6 London 192.168.5.72 41 55000 0.120 True slow

7 rows × 7 columns

pandas: Unstacking Data

Unstacking data is about pivoting data based on row values. For example, let's say we want to compute the total bytes for each delay category by location:

             slow   normal   fast
  Boston       ?       ?       ?
  SanFran      ?       ?       ?
  London       ?       ?       ?

Where the ? in each cell is the total bytes for that combination.

First we groupby both location and delay, aggregating the bytes column:

In [40]:
loc_delay_bytes = (df.groupby(['location', 'delay'])
                     .aggregate({'bytes': 'sum'}))
loc_delay_bytes
Out[40]:
bytes
location delay
Boston normal 24956
slow 16600
London slow 58400
SanFran fast 45550

4 rows × 1 columns

The resulting DataFrame has two index columns, location and delay.

Now the unstack() function is used to create a column for each unique value of delay:

In [41]:
loc_delay_bytes = loc_delay_bytes.unstack('delay')

Let's fill in zeros for the combinations that were not present:

In [42]:
loc_delay_bytes = loc_delay_bytes.fillna(0)
loc_delay_bytes
Out[42]:
bytes
delay fast normal slow
location
Boston 0 24956 16600
London 0 0 58400
SanFran 45550 0 0

3 rows × 3 columns

SteelScript for Python

SteelScript is a collection of Python packages that provide libraries for collecting, processing, and visualizing data from a variety of sources. Most packages focus on network performance.

Getting SteelScript and the Wireshark extension

Core SteelScript is available on PyPI:

> pip install steelscript
> pip install steelscript.wireshark

The bleeding edge is available on github:

  • http://github.com/riverbed/steelscript/steelscript
  • http://github.com/riverbed/steelscript/steelscript.wireshark

SteelScript: Reading Packet Capture Files

In [43]:
from steelscript.wireshark.core.pcap import PcapFile

pcap = PcapFile('/ws/traces/net-2009-11-18-17_35.pcap')
pcap.info()
print pcap.starttime
print pcap.endtime
print pcap.numpackets
2009-11-18 17:35:54-08:00
2009-11-19 09:47:05-08:00
302030

Perform a query on the PCAP file

In [44]:
pdf = pcap.query(['frame.time_epoch', 'ip.src', 'ip.dst', 'ip.len', 'ip.proto'],
                starttime = pcap.starttime,
                duration='1min',
                as_dataframe=True)
pdf = pdf[~(pdf['ip.len'].isnull())]
print len(pdf), "packets loaded"
5529 packets loaded

In [45]:
pdf[:10]
Out[45]:
frame.time_epoch ip.src ip.dst ip.len ip.proto
0 2009-11-18 17:35:54.369201-08:00 137.226.34.227 192.168.1.105 1420 6
1 2009-11-18 17:35:54.369434-08:00 137.226.34.227 192.168.1.105 1420 6
2 2009-11-18 17:35:54.369655-08:00 192.168.1.105 137.226.34.227 40 6
3 2009-11-18 17:35:54.369702-08:00 137.226.34.227 192.168.1.105 1420 6
4 2009-11-18 17:35:54.369713-08:00 137.226.34.227 192.168.1.105 1420 6
5 2009-11-18 17:35:54.369723-08:00 137.226.34.227 192.168.1.105 1332 6
6 2009-11-18 17:35:54.369733-08:00 137.226.34.227 192.168.1.105 1420 6
7 2009-11-18 17:35:54.369896-08:00 192.168.1.105 137.226.34.227 40 6
8 2009-11-18 17:35:54.369947-08:00 137.226.34.227 192.168.1.105 1420 6
9 2009-11-18 17:35:54.369986-08:00 137.226.34.227 192.168.1.105 1420 6

10 rows × 5 columns

Examine some characteristics of the data:

In [46]:
pdf['ip.proto'].unique()
Out[46]:
array([  6.,  17.])
In [47]:
tcpdf = pdf[pdf['ip.proto'] == 6]
len(tcpdf)
Out[47]:
5527

Unique source IPs, or dest IPs:

In [48]:
pdf['ip.src'].unique()
Out[48]:
array(['137.226.34.227', '192.168.1.105', '192.168.1.1'], dtype=object)
In [49]:
pdf['ip.dst'].unique()
Out[49]:
array(['192.168.1.105', '137.226.34.227', '224.0.0.1'], dtype=object)

Examine the frame.time_epoch column:

In [50]:
s = pdf['frame.time_epoch']
In [51]:
s.describe()
Out[51]:
count                                 5529
unique                                5529
top       2009-11-18 17:36:49.131404-08:00
freq                                     1
Name: frame.time_epoch, dtype: object

SteelScript: Resampling

Pandas supports a number of functions specifically designed for time-series data. Resampling is particularly useful.

The current data set pdf contains one row per packet received over one minute. Let's compute the data rate in bits/sec at 1 second granularity.

Step 1 - Time Index

Resampling requires that DataFrame object have a datetime column as an index. The return value of the above pcap.query() autmatically converts the frame.time_epoch column into a datetime, now set it as the index:

In [52]:
print "frame.time_epoch:", pdf['frame.time_epoch'].dtype
pdf_indexed = pdf.set_index('frame.time_epoch')
frame.time_epoch: object

Step 2 - Resample

Resample at 1second granularity, summing all ip.len values.

In [53]:
pdf_1sec = pdf_indexed.resample('1s', {'ip.len': 'sum'})
In [54]:
pdf_1sec.plot()
Out[54]:
<matplotlib.axes.AxesSubplot at 0x119728550>

Step 3 - Compute bps

In [55]:
pdf_1sec['bps'] = pdf_1sec['ip.len'] * 8
In [56]:
pdf_1sec.plot(y='bps')
Out[56]:
<matplotlib.axes.AxesSubplot at 0x119772f50>

Defining Helper Functions

In [57]:
from steelscript.common.timeutils import parse_timedelta, timedelta_total_seconds

def query(pcap, starttime, duration):
    """Run a query to collect frame time and ip.len, filter for IP."""
    _df = pcap.query(['frame.time_epoch', 'ip.len'], \
                     starttime=pcap.starttime, duration=duration, \
                     as_dataframe=True)
    _df = _df[~(_df['ip.len'].isnull())]
    return _df
In [58]:
def plot_bps(_df, start, duration, resolution):
    """Plot bps for a dataframe over the given range and resolution."""
    # Filter the df to the requested time range
    end = start + parse_timedelta(duration)
    _df = _df[((_df['frame.time_epoch'] >= start) &
               (_df['frame.time_epoch'] < end))]

    # set the index
    _df = _df.set_index('frame.time_epoch')

    # convert a string resolution like '10s' into numeric seconds    
    resolution = (timedelta_total_seconds(parse_timedelta(resolution)))

    # Resample
    _df = _df.resample('%ds' % resolution, {'ip.len': 'sum'})

    # Compute BPS
    _df['bps'] = _df['ip.len'] * 8 / float(resolution)

    # Plot the result
    _df.plot(y='bps')
In [59]:
%time pdf = query(pcap, starttime=pcap.starttime, duration='6h')
print len(pdf), "packets"
CPU times: user 8.7 s, sys: 3.06 s, total: 11.8 s
Wall time: 31.6 s
272901 packets

In [60]:
%time plot_bps(pdf, pcap.starttime, '6h', '15m')
CPU times: user 796 ms, sys: 9.4 ms, total: 805 ms
Wall time: 804 ms

In [61]:
%time plot_bps(pdf, pcap.starttime, '6h', '1m')
CPU times: user 815 ms, sys: 4.82 ms, total: 820 ms
Wall time: 818 ms

In [62]:
%time plot_bps(pdf, pcap.starttime, '1h', '1m')
CPU times: user 776 ms, sys: 4.8 ms, total: 781 ms
Wall time: 779 ms

In [63]:
%time plot_bps(pdf, pcap.starttime, '1h', '1s')
CPU times: user 955 ms, sys: 5.42 ms, total: 961 ms
Wall time: 959 ms

In [64]:
import datetime
from dateutil.parser import parse
import steelscript.wireshark.core.pcap
reload(steelscript.wireshark.core.pcap)
from steelscript.wireshark.core.pcap import *

SteelScript: Computing Client/Server Metrics

Now let's take a more complex example of rearranging the data.

  • Incoming data is unidirectional data based on src/dst
  • Determine cli/srv based on lower port number
  • Compute server-to-client (s2c) and client-to-server (c2s) bytes
  • Rollup aggregate metrics
  • Graph top 3 conversations

First, let's find the right field for port number. The TSharkFields class supports a find() method to look for fields by protocol, name, or description.

In [65]:
from steelscript.wireshark.core.pcap import TSharkFields, PcapFile

pcap = PcapFile('/ws/traces/net-2009-11-18-17_35.pcap')

tf = TSharkFields.instance()
tf.find(protocol='tcp', name_re='port')
Out[65]:
[<TSharkField tcp.port, FT_UINT16>,
 <TSharkField tcp.options.rvbd.trpy.src.port, FT_UINT16>,
 <TSharkField tcp.analysis.reused_ports, FT_NONE>,
 <TSharkField tcp.options.rvbd.trpy.client.port, FT_UINT16>,
 <TSharkField tcp.dstport, FT_UINT16>,
 <TSharkField tcp.srcport, FT_UINT16>,
 <TSharkField tcp.options.mptcp.port, FT_UINT16>,
 <TSharkField tcp.options.rvbd.probe.proxy.port, FT_UINT16>,
 <TSharkField tcp.options.rvbd.trpy.dst.port, FT_UINT16>]

Query the PCAP file for the necessary raw data, compute cli/srv/c2s/s2c:

In [66]:
%%time
pdf = pcap.query(['frame.time_epoch', 'ip.src', 'ip.dst', 'ip.len',
                  'tcp.srcport', 'tcp.dstport'],
                 starttime=pcap.starttime, duration='1m',
                 as_dataframe=True)
CPU times: user 14.2 s, sys: 7.88 s, total: 22.1 s
Wall time: 37.4 s

In [67]:
%%time
# Limit to TCP Traffic
istcp = ~(pdf['tcp.srcport'].isnull())
pdf = pdf[istcp]

# Assume lower port is the client
srccli = pdf['tcp.srcport'] > pdf['tcp.dstport']
CPU times: user 48.7 ms, sys: 11.2 ms, total: 59.8 ms
Wall time: 58.6 ms

In [68]:
%%time
# Initialize columns assuming server->client 
pdf['ip.cli'] = pdf['ip.dst']
pdf['ip.srv'] = pdf['ip.src']
pdf['tcp.srvport'] = pdf['tcp.srcport']
pdf['c2s'] = 0
pdf['s2c'] = pdf['ip.len']
CPU times: user 29.3 ms, sys: 4.29 ms, total: 33.5 ms
Wall time: 32.2 ms

In [69]:
%%time
# Then override for client->server
pdf.ix[srccli, 'ip.cli']      = pdf.ix[srccli, 'ip.src']
pdf.ix[srccli, 'ip.srv']      = pdf.ix[srccli, 'ip.dst']
pdf.ix[srccli, 'tcp.srvport'] = pdf['tcp.dstport']
pdf.ix[srccli, 'c2s']         = pdf.ix[srccli, 'ip.len']
pdf.ix[srccli, 's2c']         = 0
CPU times: user 282 ms, sys: 31.3 ms, total: 314 ms
Wall time: 312 ms

Strip away the src/dst columns in favor of the cli/srv columns

In [70]:
pdf = pdf.ix[:, ['frame.time_epoch', 'ip.cli', 'ip.srv', 'tcp.srvport',
                 'ip.len', 'c2s', 's2c']]
pdf[:5]
Out[70]:
frame.time_epoch ip.cli ip.srv tcp.srvport ip.len c2s s2c
0 2009-11-18 17:35:54.369201-08:00 192.168.1.105 137.226.34.227 80 1420 0 1420
1 2009-11-18 17:35:54.369434-08:00 192.168.1.105 137.226.34.227 80 1420 0 1420
2 2009-11-18 17:35:54.369655-08:00 192.168.1.105 137.226.34.227 80 40 40 0
3 2009-11-18 17:35:54.369702-08:00 192.168.1.105 137.226.34.227 80 1420 0 1420
4 2009-11-18 17:35:54.369713-08:00 192.168.1.105 137.226.34.227 80 1420 0 1420

5 rows × 7 columns

Now, we can compute metrics for each unique host-pair:

In [71]:
cs = (pdf.groupby(['ip.cli', 'ip.srv', 'tcp.srvport'])
         .aggregate({'c2s': 'sum',
                     's2c': 'sum',
                     'ip.len': 'sum'}))
In [72]:
cs.sort('ip.len', ascending=False)[:10]
Out[72]:
c2s ip.len s2c
ip.cli ip.srv tcp.srvport
192.168.1.105 137.226.34.227 80 1888380 127030608 125142228
65.54.95.7 80 647104 79579509 78932405
65.54.95.201 80 194283 27386115 27191832
68.142.123.21 80 165452 14803514 14638062
65.54.95.209 80 73449 9929531 9856082
208.111.129.62 80 72190 5422145 5349955
68.142.123.31 80 64617 5085691 5021074
192.168.1.102 207.171.185.129 80 25447 3322205 3296758
192.168.1.104 87.106.1.47 80 30364 1468823 1438459
192.168.1.105 65.54.95.185 80 12475 948700 936225

10 rows × 3 columns

Pick the top 10 conversations, this will be used later for filtering:

In [73]:
top = cs.sort('ip.len', ascending=False)[:3]
top
Out[73]:
c2s ip.len s2c
ip.cli ip.srv tcp.srvport
192.168.1.105 137.226.34.227 80 1888380 127030608 125142228
65.54.95.7 80 647104 79579509 78932405
65.54.95.201 80 194283 27386115 27191832

3 rows × 3 columns

Now, filter the original choosing only rows in the top 3:

In [74]:
cst = pdf.set_index(['ip.cli', 'ip.srv', 'tcp.srvport'])
In [75]:
cst_top = (cst[cst.index.isin(top.index)]
            .ix[:,['frame.time_epoch', 'ip.len']])
cst_top[:10]
Out[75]:
frame.time_epoch ip.len
ip.cli ip.srv tcp.srvport
192.168.1.105 137.226.34.227 80 2009-11-18 17:35:54.369201-08:00 1420
80 2009-11-18 17:35:54.369434-08:00 1420
80 2009-11-18 17:35:54.369655-08:00 40
80 2009-11-18 17:35:54.369702-08:00 1420
80 2009-11-18 17:35:54.369713-08:00 1420
80 2009-11-18 17:35:54.369723-08:00 1332
80 2009-11-18 17:35:54.369733-08:00 1420
80 2009-11-18 17:35:54.369896-08:00 40
80 2009-11-18 17:35:54.369947-08:00 1420
80 2009-11-18 17:35:54.369986-08:00 1420

10 rows × 2 columns

Rollup the results into 1 second intervals

In [76]:
cst_top_time = (cst_top.reset_index()
                       .set_index(['frame.time_epoch', 'ip.cli', 'ip.srv', 'tcp.srvport'])
                       .unstack(['ip.cli', 'ip.srv', 'tcp.srvport'])
                       .resample('1s','sum')
                       .fillna(0))
In [77]:
cst_top_time.plot()
Out[77]:
<matplotlib.axes.AxesSubplot at 0x121281190>

Questons?

Presentation available online:

  • https://support.riverbed.com/apis/steelscript/SharkFest2014.slides.html

SteelScript for Python:

  • https://support.riverbed.com/apis/steelscript/index.html