SharkFest 2014 - Packet Analysis and Visualization with SteelScript¶

This presentation was delivered at SharkFest on June 17, 2014 by Christopher J. White. The source file SharkFest2014.ipynb is an IPython Notebook.

Overview

Visualizing with SteelScript Application Framework
Tools in my toolbox
Python Pandas
PCAP Analysis with SteelScript

SteelScript Application Framework

PCAP File: /ws/sharkfest2014/oneday.pcap

ip.len field over time
95th and 80th percential
Exponential Weighted Moving Average (EWMA)

Tools: IPython

Powerful interactive shells (terminal and Qt-based).
A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
Support for interactive data visualization and use of GUI toolkits.
Flexible, embeddable interpreters to load into your own projects.
Easy to use, high performance tools for parallel computing.

Installation

> pip install ipython

Tools: pandas - Python Data Analysis Library

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

Series - array of data with an optional index
DataFrame - 2D array of data hierarchical row and column indexing

Installation

Linux / Mac with dev tools?

> pip install pandas

Otherwise see pandas.pydata.org

Tools: matlibplot - Python Plotting

matlibplot hooks in with IPython notebook to provide in browser graphs.

In [1]:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2)
y = np.sin(4 * np.pi * x) * np.exp(-5 * x)

plt.fill(x, y, 'r')
plt.grid(True)
plt.show()

Tools: SteelScript Application Framework

Web front end for simple user interface
Design custom reports
- mix and match widgets and data
- define custom criteria
Custom analysis via Python hooks and Python Pandas
- compute statistics, pivot tables, merge, sort, resample timeseries
Plugin architecture makes it easy to share modules

pandas

Primary object types:

Series - 1 dimensional array of element
DataFrames - 2 dimensional array of elements

Data is stored is compact binary form for efficiency of operation.

Supports both row and column indexing by name.

pandas: Series

A Series is similar to a standard Python list, however much more efficient in memory and computation.

In [2]:

import pandas, numpy
s = pandas.Series([10, 23, 19, 15, 56, 15, 41])
print type(s)

<class 'pandas.core.series.Series'>

In [3]:

Out[3]:

0    10
1    23
2    19
3    15
4    56
5    15
6    41
dtype: int64

In [4]:

s.sum(), s.min(), s.max(), s.mean()

Out[4]:

(179, 10, 56, 25.571428571428573)

Consider processing 1,000,000 random entries in a standard list:

In [5]:

%%time
s = list(numpy.random.randn(1000000))

CPU times: user 147 ms, sys: 19.1 ms, total: 166 ms
Wall time: 166 ms

In [6]:

%%time
print min(s), max(s), sum(s)

-5.15466016548 4.83339730438 1070.4119682
CPU times: user 384 ms, sys: 966 µs, total: 385 ms
Wall time: 384 ms

Now, consider processing 1,000,000 random entries in a pandas Series:

In [7]:

%%time
s = pandas.Series(numpy.random.randn(1000000))
len(s)

CPU times: user 153 ms, sys: 12.6 ms, total: 166 ms
Wall time: 166 ms

In [8]:

%%time
print s.min(), s.max(), s.sum()

-4.88164961037 4.76370359345 1151.10605833
CPU times: user 19.6 ms, sys: 5.19 ms, total: 24.8 ms
Wall time: 24.1 ms

In [9]:

del s

pandas: DataFrame

A DataFrame is typically loaded from a file, but my be created from a list of lists:

In [10]:

import pandas
df = pandas.DataFrame(
    [['Boston', '10.1.1.1', 10, 2356, 0.100],
     ['Boston', '10.1.1.2', 23, 16600, 0.112],
     ['Boston', '10.1.1.15', 19, 22600, 0.085],
     ['SanFran', '10.38.5.1', 15, 10550, 0.030],
     ['SanFran', '10.38.8.2', 56, 35000, 0.020],
     ['London', '192.168.4.6', 15, 3400, 0.130],
     ['London', '192.168.5.72', 41, 55000, 0.120]],
     columns = ['location', 'ip', 'pkts', 'bytes', 'rtt'])

Each column's data type is automatically detected based on contents.

In [11]:

df.dtypes

Out[11]:

location     object
ip           object
pkts          int64
bytes         int64
rtt         float64
dtype: object

IPython has special support for displaying DataFrames

In [12]:

df

Out[12]:

	location	ip	pkts	bytes	rtt
0	Boston	10.1.1.1	10	2356	0.100
1	Boston	10.1.1.2	23	16600	0.112
2	Boston	10.1.1.15	19	22600	0.085
3	SanFran	10.38.5.1	15	10550	0.030
4	SanFran	10.38.8.2	56	35000	0.020
5	London	192.168.4.6	15	3400	0.130
6	London	192.168.5.72	41	55000	0.120

7 rows × 5 columns

pandas: DataFrame operations

The DataFrame supports a number of operations directly

In [13]:

df.mean()

Out[13]:

pkts        25.571429
bytes    20786.571429
rtt          0.085286
dtype: float64

Notice that location and ip are string columns, thus there is no operation 'mean' for such columns.

In [14]:

df.sum()

Out[14]:

location         BostonBostonBostonSanFranSanFranLondonLondon
ip          10.1.1.110.1.1.210.1.1.1510.38.5.110.38.8.2192...
pkts                                                      179
bytes                                                  145506
rtt                                                     0.597
dtype: object

In the case of sum(), it is possible to add strings, so location and ip are included in the results. (Be careful, on large data sets this can backfire!)

pandas: Selecting a Column

Each column of data in a DataFrame can be extracted by the standard indexing operator:

df[_colname_]

The result is a Series.

In [15]:

df['location']

Out[15]:

0     Boston
1     Boston
2     Boston
3    SanFran
4    SanFran
5     London
6     London
Name: location, dtype: object

In [16]:

df['pkts']

Out[16]:

0    10
1    23
2    19
3    15
4    56
5    15
6    41
Name: pkts, dtype: int64

pandas: Selecting Rows

Selecting a subset of rows based by index works just like array slicing:

df[_start_:_end_]

Note that the : is always required, but <start> and <end> are optional. This will return all rows with an index greater than or equal to <start>, up to but not including <end>

In [17]:

df[:2]

Out[17]:

	location	ip	pkts	bytes	rtt
0	Boston	10.1.1.1	10	2356	0.100
1	Boston	10.1.1.2	23	16600	0.112

2 rows × 5 columns

In [18]:

df[2:4]

Out[18]:

	location	ip	pkts	bytes	rtt
2	Boston	10.1.1.15	19	22600	0.085
3	SanFran	10.38.5.1	15	10550	0.030

2 rows × 5 columns

pandas: Filtering Rows

DataFrame rows can be filtered using boolean expressions

df[_boolean_expression_]

In [19]:

df[df['location'] == 'Boston']

Out[19]:

	location	ip	pkts	bytes	rtt
0	Boston	10.1.1.1	10	2356	0.100
1	Boston	10.1.1.2	23	16600	0.112
2	Boston	10.1.1.15	19	22600	0.085

3 rows × 5 columns

In [20]:

df[df['pkts'] < 20]

Out[20]:

	location	ip	pkts	bytes	rtt
0	Boston	10.1.1.1	10	2356	0.100
2	Boston	10.1.1.15	19	22600	0.085
3	SanFran	10.38.5.1	15	10550	0.030
5	London	192.168.4.6	15	3400	0.130

4 rows × 5 columns

Expressions can be combined to provide more complex filtering:

In [21]:

df[(df['location'] == 'Boston') & (df['pkts'] < 20)]

Out[21]:

	location	ip	pkts	bytes	rtt
0	Boston	10.1.1.1	10	2356	0.100
2	Boston	10.1.1.15	19	22600	0.085

2 rows × 5 columns

The boolean expression is actually a Series of True/False values of the same length as the number of rows, thus can actually be assigned as a series:

In [22]:

bos = df['location'] == 'Boston'
print type(bos)
bos

<class 'pandas.core.series.Series'>

Out[22]:

0     True
1     True
2     True
3    False
4    False
5    False
6    False
Name: location, dtype: bool

In [23]:

pkts_lt_20 = (df['pkts'] < 20)
df[bos & pkts_lt_20]

Out[23]:

	location	ip	pkts	bytes	rtt
0	Boston	10.1.1.1	10	2356	0.100
2	Boston	10.1.1.15	19	22600	0.085

2 rows × 5 columns

Use single "&" and "|" and parenthesis for constructing arbitrary boolean expressions. Use "~" for negation.

The result of a filtering operation is a new DataFrame object

In [24]:

df2 = df[bos & pkts_lt_20]
print 'Length:', len(df2)
df2

Length: 2

Out[24]:

	location	ip	pkts	bytes	rtt
0	Boston	10.1.1.1	10	2356	0.100
2	Boston	10.1.1.15	19	22600	0.085

2 rows × 5 columns

pandas: Selecting Multiple Columns

A new DataFrame can be constructed from a subset of another DataFrame using the ix[] operator.

DataFrame.ix[_row_expr_,_column_lise_]

In [25]:

locrtt = df.ix[:,['location', 'rtt']]
locrtt

Out[25]:

	location	rtt
0	Boston	0.100
1	Boston	0.112
2	Boston	0.085
3	SanFran	0.030
4	SanFran	0.020
5	London	0.130
6	London	0.120

7 rows × 2 columns

The first argument to the ix[] indexer is actually expected to be a boolean Series. This makes it possible to select rows and columns in a single operation. The use of ":" above is short-hand for all rows.

In [26]:

boslocrtt = df.ix[bos,['location', 'rtt']]
boslocrtt

Out[26]:

	location	rtt
0	Boston	0.100
1	Boston	0.112
2	Boston	0.085

3 rows × 2 columns

pandas: Adding Columns

New columns (Series) be added to a DataFrame in one of three ways:

A constant value for all rows
Supplying a list matching the number of rows
Expression based on constants and Series objects of the same size

In [27]:

df['co'] = 'RVBD'
df['proto'] = ['tcp', 'tcp', 'udp', 'udp', 'tcp', 'tcp', 'udp']
df['Bpp'] = df['bytes'] / df['pkts']   # Bytes per packet
df['bpp'] = 8 * df['bytes'] / df['pkts']
df['slow'] = df['rtt'] >= 0.1
df.ix[:,['ip', 'proto', 'Bpp', 'bpp', 'slow', 'delay', 'co']]

Out[27]:

	ip	proto	Bpp	bpp	slow	delay	co
0	10.1.1.1	tcp	235.600000	1884.800000	True	NaN	RVBD
1	10.1.1.2	tcp	721.739130	5773.913043	True	NaN	RVBD
2	10.1.1.15	udp	1189.473684	9515.789474	False	NaN	RVBD
3	10.38.5.1	udp	703.333333	5626.666667	False	NaN	RVBD
4	10.38.8.2	tcp	625.000000	5000.000000	False	NaN	RVBD
5	192.168.4.6	tcp	226.666667	1813.333333	True	NaN	RVBD
6	192.168.5.72	udp	1341.463415	10731.707317	True	NaN	RVBD

7 rows × 7 columns

pandas: Grouping and Aggregating

A common operation is to group data by a key column or set of of key columns. The groups rows that share the same value for all key columns producing one row per unique key pair:

df.groupby(_key_columns_]

For example, to group by the location column:

In [28]:

gb = df.groupby('location')
print type(gb)
gb.indices

<class 'pandas.core.groupby.DataFrameGroupBy'>

Out[28]:

{'Boston': array([0, 1, 2]), 'London': array([5, 6]), 'SanFran': array([3, 4])}

The result of a grouping operation is a GroupBy object. The above indices shows you the row numbers for the grouped data. The data of a GroupBy object cannot be inspected directly until aggregated.

The groupby() operation is usually followed by an aggregate() call to combine the values of all related rows in each column.

In one form, the aggregate() call takes a dictionary indicating the operation to perform on each column to comine values:

GroupBy.aggregate({<colname>: <operation>,
                   <colname>: <operation>,
                   ... })

Only columns listed in the dictionary will be returned in the resulting DataFrame.

In [29]:

gb.aggregate({'pkts': 'sum',
              'rtt': 'mean'})

Out[29]:

	pkts	rtt
location
Boston	52	0.099
London	56	0.125
SanFran	71	0.025

3 rows × 2 columns

By copying columns, it is possible to compute alternate operations on the same data, producing different aggregated results:

In [30]:

df2 = df.ix[:,['location', 'pkts', 'rtt']]
df2['peak_rtt'] = df['rtt']
df2['min_rtt'] = df['rtt']
df2

Out[30]:

	location	pkts	rtt	peak_rtt	min_rtt
0	Boston	10	0.100	0.100	0.100
1	Boston	23	0.112	0.112	0.112
2	Boston	19	0.085	0.085	0.085
3	SanFran	15	0.030	0.030	0.030
4	SanFran	56	0.020	0.020	0.020
5	London	15	0.130	0.130	0.130
6	London	41	0.120	0.120	0.120

7 rows × 5 columns

In [31]:

agg = (df2.groupby('location')
          .aggregate({'pkts': 'sum',
                      'rtt': 'mean',
                      'peak_rtt': 'max',
                      'min_rtt': 'min'}))
agg

Out[31]:

	pkts	rtt	min_rtt	peak_rtt
location
Boston	52	0.099	0.085	0.112
London	56	0.125	0.120	0.130
SanFran	71	0.025	0.020	0.030

3 rows × 4 columns

pandas: Indexing

Notice that the output of the preview groupby/aggregate operation looked a bit different from previous data frames, the first column is in bold:

In [32]:

agg

Out[32]:

	pkts	rtt	min_rtt	peak_rtt
location
Boston	52	0.099	0.085	0.112
London	56	0.125	0.120	0.130
SanFran	71	0.025	0.020	0.030

3 rows × 4 columns

The bold rows/columns indicate that the DataFrame is indexed.

Let's look at the DataFrame without the index -- by calling reset_index():

In [33]:

agg.reset_index()

Out[33]:

	location	pkts	rtt	min_rtt	peak_rtt
0	Boston	52	0.099	0.085	0.112
1	London	56	0.125	0.120	0.130
2	SanFran	71	0.025	0.020	0.030

3 rows × 5 columns

Indexing a DataFrame makes the data in the indexed column to meta-data associated with the object. This index is carried on to Series objects:

In [34]:

agg['pkts']

Out[34]:

location
Boston      52
London      56
SanFran     71
Name: pkts, dtype: int64

The resulting series still acts like an array, but may be indexed by either a numeric row value or a location name:

In [35]:

print "Boston:", agg['pkts']['Boston']
print "Item 0:", agg['pkts'][0]

Boston: 52
Item 0: 52

In [36]:

df = df.ix[:,['location', 'ip', 'pkts', 'bytes', 'rtt', 'slow']]

pandas: Modifying a Subset of Rows

Often it is useful to assign a subset of rows in a column. This is possible by assigning a value to the result of the ix[] indexer:

DataFrame.ix[_row_expr_, _column_list_>] = <new value>

When used in this form, the DataFrame object indexed is modified in place.

Let's use this method to assign a value of 'slow', 'normal', or 'fast' to a new 'delay' column based on rtt:

In [37]:

df['delay'] = ''
df

Out[37]:

	location	ip	pkts	bytes	rtt	slow
0	Boston	10.1.1.1	10	2356	0.100	True
1	Boston	10.1.1.2	23	16600	0.112	True
2	Boston	10.1.1.15	19	22600	0.085	False
3	SanFran	10.38.5.1	15	10550	0.030	False
4	SanFran	10.38.8.2	56	35000	0.020	False
5	London	192.168.4.6	15	3400	0.130	True
6	London	192.168.5.72	41	55000	0.120	True

7 rows × 7 columns

Compute boolean Series for each range of rtt:

In [38]:

slow_rows = (df['rtt'] > 0.110)
normal_rows = ((df['rtt'] > 0.050) & (~slow_rows))
fast_rows = ((~slow_rows) & (~normal_rows))

In [39]:

df.ix[slow_rows,'delay'] = 'slow'
df.ix[normal_rows, 'delay'] = 'normal'
df.ix[fast_rows, 'delay'] = 'fast'
df

Out[39]:

	location	ip	pkts	bytes	rtt	slow	delay
0	Boston	10.1.1.1	10	2356	0.100	True	normal
1	Boston	10.1.1.2	23	16600	0.112	True	slow
2	Boston	10.1.1.15	19	22600	0.085	False	normal
3	SanFran	10.38.5.1	15	10550	0.030	False	fast
4	SanFran	10.38.8.2	56	35000	0.020	False	fast
5	London	192.168.4.6	15	3400	0.130	True	slow
6	London	192.168.5.72	41	55000	0.120	True	slow

7 rows × 7 columns

pandas: Unstacking Data

Unstacking data is about pivoting data based on row values. For example, let's say we want to compute the total bytes for each delay category by location:

             slow   normal   fast
  Boston       ?       ?       ?
  SanFran      ?       ?       ?
  London       ?       ?       ?

Where the ? in each cell is the total bytes for that combination.

First we groupby both location and delay, aggregating the bytes column:

In [40]:

loc_delay_bytes = (df.groupby(['location', 'delay'])
                     .aggregate({'bytes': 'sum'}))
loc_delay_bytes

Out[40]:

		bytes
location	delay
Boston	normal	24956
Boston	slow	16600
London	slow	58400
SanFran	fast	45550

4 rows × 1 columns

The resulting DataFrame has two index columns, location and delay.

Now the unstack() function is used to create a column for each unique value of delay:

In [41]:

loc_delay_bytes = loc_delay_bytes.unstack('delay')

Let's fill in zeros for the combinations that were not present:

In [42]:

loc_delay_bytes = loc_delay_bytes.fillna(0)
loc_delay_bytes

Out[42]:

	bytes
delay	fast	normal	slow
location
Boston	0	24956	16600
London	0	0	58400
SanFran	45550	0	0

3 rows × 3 columns

SteelScript for Python

SteelScript is a collection of Python packages that provide libraries for collecting, processing, and visualizing data from a variety of sources. Most packages focus on network performance.

Getting SteelScript and the Wireshark extension

Core SteelScript is available on PyPI:

> pip install steelscript
> pip install steelscript.wireshark

The bleeding edge is available on github:

http://github.com/riverbed/steelscript/steelscript
http://github.com/riverbed/steelscript/steelscript.wireshark

SteelScript: Reading Packet Capture Files

In [43]:

from steelscript.wireshark.core.pcap import PcapFile

pcap = PcapFile('/ws/traces/net-2009-11-18-17_35.pcap')
pcap.info()
print pcap.starttime
print pcap.endtime
print pcap.numpackets

2009-11-18 17:35:54-08:00
2009-11-19 09:47:05-08:00
302030

Perform a query on the PCAP file

In [44]:

pdf = pcap.query(['frame.time_epoch', 'ip.src', 'ip.dst', 'ip.len', 'ip.proto'],
                starttime = pcap.starttime,
                duration='1min',
                as_dataframe=True)
pdf = pdf[~(pdf['ip.len'].isnull())]
print len(pdf), "packets loaded"

5529 packets loaded

In [45]:

pdf[:10]

Out[45]:

	frame.time_epoch	ip.src	ip.dst	ip.len	ip.proto
0	2009-11-18 17:35:54.369201-08:00	137.226.34.227	192.168.1.105	1420	6
1	2009-11-18 17:35:54.369434-08:00	137.226.34.227	192.168.1.105	1420	6
2	2009-11-18 17:35:54.369655-08:00	192.168.1.105	137.226.34.227	40	6
3	2009-11-18 17:35:54.369702-08:00	137.226.34.227	192.168.1.105	1420	6
4	2009-11-18 17:35:54.369713-08:00	137.226.34.227	192.168.1.105	1420	6
5	2009-11-18 17:35:54.369723-08:00	137.226.34.227	192.168.1.105	1332	6
6	2009-11-18 17:35:54.369733-08:00	137.226.34.227	192.168.1.105	1420	6
7	2009-11-18 17:35:54.369896-08:00	192.168.1.105	137.226.34.227	40	6
8	2009-11-18 17:35:54.369947-08:00	137.226.34.227	192.168.1.105	1420	6
9	2009-11-18 17:35:54.369986-08:00	137.226.34.227	192.168.1.105	1420	6

10 rows × 5 columns

Examine some characteristics of the data:

In [46]:

pdf['ip.proto'].unique()

Out[46]:

array([  6.,  17.])

In [47]:

tcpdf = pdf[pdf['ip.proto'] == 6]
len(tcpdf)

Out[47]:

Unique source IPs, or dest IPs:

In [48]:

pdf['ip.src'].unique()

Out[48]:

array(['137.226.34.227', '192.168.1.105', '192.168.1.1'], dtype=object)

In [49]:

pdf['ip.dst'].unique()

Out[49]:

array(['192.168.1.105', '137.226.34.227', '224.0.0.1'], dtype=object)

Examine the frame.time_epoch column:

In [50]:

s = pdf['frame.time_epoch']

In [51]:

s.describe()

Out[51]:

count                                 5529
unique                                5529
top       2009-11-18 17:36:49.131404-08:00
freq                                     1
Name: frame.time_epoch, dtype: object

SteelScript: Resampling

Pandas supports a number of functions specifically designed for time-series data. Resampling is particularly useful.

The current data set pdf contains one row per packet received over one minute. Let's compute the data rate in bits/sec at 1 second granularity.

Step 1 - Time Index

Resampling requires that DataFrame object have a datetime column as an index. The return value of the above pcap.query() autmatically converts the frame.time_epoch column into a datetime, now set it as the index:

In [52]:

print "frame.time_epoch:", pdf['frame.time_epoch'].dtype
pdf_indexed = pdf.set_index('frame.time_epoch')

frame.time_epoch: object

Step 2 - Resample

Resample at 1second granularity, summing all ip.len values.

In [53]:

pdf_1sec = pdf_indexed.resample('1s', {'ip.len': 'sum'})

In [54]:

pdf_1sec.plot()

Out[54]:

<matplotlib.axes.AxesSubplot at 0x119728550>

Step 3 - Compute bps

In [55]:

pdf_1sec['bps'] = pdf_1sec['ip.len'] * 8

In [56]:

pdf_1sec.plot(y='bps')

Out[56]:

<matplotlib.axes.AxesSubplot at 0x119772f50>

Defining Helper Functions

In [57]:

from steelscript.common.timeutils import parse_timedelta, timedelta_total_seconds

def query(pcap, starttime, duration):
    """Run a query to collect frame time and ip.len, filter for IP."""
    _df = pcap.query(['frame.time_epoch', 'ip.len'], \
                     starttime=pcap.starttime, duration=duration, \
                     as_dataframe=True)
    _df = _df[~(_df['ip.len'].isnull())]
    return _df

In [58]:

def plot_bps(_df, start, duration, resolution):
    """Plot bps for a dataframe over the given range and resolution."""
    # Filter the df to the requested time range
    end = start + parse_timedelta(duration)
    _df = _df[((_df['frame.time_epoch'] >= start) &
               (_df['frame.time_epoch'] < end))]

    # set the index
    _df = _df.set_index('frame.time_epoch')

    # convert a string resolution like '10s' into numeric seconds    
    resolution = (timedelta_total_seconds(parse_timedelta(resolution)))

    # Resample
    _df = _df.resample('%ds' % resolution, {'ip.len': 'sum'})

    # Compute BPS
    _df['bps'] = _df['ip.len'] * 8 / float(resolution)

    # Plot the result
    _df.plot(y='bps')

In [59]:

%time pdf = query(pcap, starttime=pcap.starttime, duration='6h')
print len(pdf), "packets"

CPU times: user 8.7 s, sys: 3.06 s, total: 11.8 s
Wall time: 31.6 s
272901 packets

In [60]:

%time plot_bps(pdf, pcap.starttime, '6h', '15m')

CPU times: user 796 ms, sys: 9.4 ms, total: 805 ms
Wall time: 804 ms

In [61]:

%time plot_bps(pdf, pcap.starttime, '6h', '1m')

CPU times: user 815 ms, sys: 4.82 ms, total: 820 ms
Wall time: 818 ms

In [62]:

%time plot_bps(pdf, pcap.starttime, '1h', '1m')

CPU times: user 776 ms, sys: 4.8 ms, total: 781 ms
Wall time: 779 ms

In [63]:

%time plot_bps(pdf, pcap.starttime, '1h', '1s')

CPU times: user 955 ms, sys: 5.42 ms, total: 961 ms
Wall time: 959 ms

In [64]:

import datetime
from dateutil.parser import parse
import steelscript.wireshark.core.pcap
reload(steelscript.wireshark.core.pcap)
from steelscript.wireshark.core.pcap import *

SteelScript: Computing Client/Server Metrics

Now let's take a more complex example of rearranging the data.

Incoming data is unidirectional data based on src/dst
Determine cli/srv based on lower port number
Compute server-to-client (s2c) and client-to-server (c2s) bytes
Rollup aggregate metrics
Graph top 3 conversations

First, let's find the right field for port number. The TSharkFields class supports a find() method to look for fields by protocol, name, or description.

In [65]:

from steelscript.wireshark.core.pcap import TSharkFields, PcapFile

pcap = PcapFile('/ws/traces/net-2009-11-18-17_35.pcap')

tf = TSharkFields.instance()
tf.find(protocol='tcp', name_re='port')

Out[65]:

[<TSharkField tcp.port, FT_UINT16>,
 <TSharkField tcp.options.rvbd.trpy.src.port, FT_UINT16>,
 <TSharkField tcp.analysis.reused_ports, FT_NONE>,
 <TSharkField tcp.options.rvbd.trpy.client.port, FT_UINT16>,
 <TSharkField tcp.dstport, FT_UINT16>,
 <TSharkField tcp.srcport, FT_UINT16>,
 <TSharkField tcp.options.mptcp.port, FT_UINT16>,
 <TSharkField tcp.options.rvbd.probe.proxy.port, FT_UINT16>,
 <TSharkField tcp.options.rvbd.trpy.dst.port, FT_UINT16>]

Query the PCAP file for the necessary raw data, compute cli/srv/c2s/s2c:

In [66]:

%%time
pdf = pcap.query(['frame.time_epoch', 'ip.src', 'ip.dst', 'ip.len',
                  'tcp.srcport', 'tcp.dstport'],
                 starttime=pcap.starttime, duration='1m',
                 as_dataframe=True)

CPU times: user 14.2 s, sys: 7.88 s, total: 22.1 s
Wall time: 37.4 s

In [67]:

%%time
# Limit to TCP Traffic
istcp = ~(pdf['tcp.srcport'].isnull())
pdf = pdf[istcp]

# Assume lower port is the client
srccli = pdf['tcp.srcport'] > pdf['tcp.dstport']

CPU times: user 48.7 ms, sys: 11.2 ms, total: 59.8 ms
Wall time: 58.6 ms

In [68]:

%%time
# Initialize columns assuming server->client 
pdf['ip.cli'] = pdf['ip.dst']
pdf['ip.srv'] = pdf['ip.src']
pdf['tcp.srvport'] = pdf['tcp.srcport']
pdf['c2s'] = 0
pdf['s2c'] = pdf['ip.len']

CPU times: user 29.3 ms, sys: 4.29 ms, total: 33.5 ms
Wall time: 32.2 ms

In [69]:

%%time
# Then override for client->server
pdf.ix[srccli, 'ip.cli']      = pdf.ix[srccli, 'ip.src']
pdf.ix[srccli, 'ip.srv']      = pdf.ix[srccli, 'ip.dst']
pdf.ix[srccli, 'tcp.srvport'] = pdf['tcp.dstport']
pdf.ix[srccli, 'c2s']         = pdf.ix[srccli, 'ip.len']
pdf.ix[srccli, 's2c']         = 0

CPU times: user 282 ms, sys: 31.3 ms, total: 314 ms
Wall time: 312 ms

Strip away the src/dst columns in favor of the cli/srv columns

In [70]:

pdf = pdf.ix[:, ['frame.time_epoch', 'ip.cli', 'ip.srv', 'tcp.srvport',
                 'ip.len', 'c2s', 's2c']]
pdf[:5]

Out[70]:

	frame.time_epoch	ip.cli	ip.srv	tcp.srvport	ip.len	c2s	s2c
0	2009-11-18 17:35:54.369201-08:00	192.168.1.105	137.226.34.227	80	1420	0	1420
1	2009-11-18 17:35:54.369434-08:00	192.168.1.105	137.226.34.227	80	1420	0	1420
2	2009-11-18 17:35:54.369655-08:00	192.168.1.105	137.226.34.227	80	40	40	0
3	2009-11-18 17:35:54.369702-08:00	192.168.1.105	137.226.34.227	80	1420	0	1420
4	2009-11-18 17:35:54.369713-08:00	192.168.1.105	137.226.34.227	80	1420	0	1420

5 rows × 7 columns

Now, we can compute metrics for each unique host-pair:

In [71]:

cs = (pdf.groupby(['ip.cli', 'ip.srv', 'tcp.srvport'])
         .aggregate({'c2s': 'sum',
                     's2c': 'sum',
                     'ip.len': 'sum'}))

In [72]:

cs.sort('ip.len', ascending=False)[:10]

Out[72]:

			c2s	ip.len	s2c
ip.cli	ip.srv	tcp.srvport
192.168.1.105	137.226.34.227	80	1888380	127030608	125142228
	65.54.95.7	80	647104	79579509	78932405
	65.54.95.201	80	194283	27386115	27191832
	68.142.123.21	80	165452	14803514	14638062
	65.54.95.209	80	73449	9929531	9856082
	208.111.129.62	80	72190	5422145	5349955
	68.142.123.31	80	64617	5085691	5021074
192.168.1.102	207.171.185.129	80	25447	3322205	3296758
192.168.1.104	87.106.1.47	80	30364	1468823	1438459
192.168.1.105	65.54.95.185	80	12475	948700	936225

10 rows × 3 columns

Pick the top 10 conversations, this will be used later for filtering:

In [73]:

top = cs.sort('ip.len', ascending=False)[:3]
top

Out[73]:

			c2s	ip.len	s2c
ip.cli	ip.srv	tcp.srvport
192.168.1.105	137.226.34.227	80	1888380	127030608	125142228
	65.54.95.7	80	647104	79579509	78932405
	65.54.95.201	80	194283	27386115	27191832

3 rows × 3 columns

Now, filter the original choosing only rows in the top 3:

In [74]:

cst = pdf.set_index(['ip.cli', 'ip.srv', 'tcp.srvport'])

In [75]:

cst_top = (cst[cst.index.isin(top.index)]
            .ix[:,['frame.time_epoch', 'ip.len']])
cst_top[:10]

Out[75]:

			frame.time_epoch	ip.len
ip.cli	ip.srv	tcp.srvport
192.168.1.105	137.226.34.227	80	2009-11-18 17:35:54.369201-08:00	1420
		80	2009-11-18 17:35:54.369434-08:00	1420
		80	2009-11-18 17:35:54.369655-08:00	40
		80	2009-11-18 17:35:54.369702-08:00	1420
		80	2009-11-18 17:35:54.369713-08:00	1420
		80	2009-11-18 17:35:54.369723-08:00	1332
		80	2009-11-18 17:35:54.369733-08:00	1420
		80	2009-11-18 17:35:54.369896-08:00	40
		80	2009-11-18 17:35:54.369947-08:00	1420
		80	2009-11-18 17:35:54.369986-08:00	1420

10 rows × 2 columns

Rollup the results into 1 second intervals

In [76]:

cst_top_time = (cst_top.reset_index()
                       .set_index(['frame.time_epoch', 'ip.cli', 'ip.srv', 'tcp.srvport'])
                       .unstack(['ip.cli', 'ip.srv', 'tcp.srvport'])
                       .resample('1s','sum')
                       .fillna(0))

In [77]:

cst_top_time.plot()

Out[77]:

<matplotlib.axes.AxesSubplot at 0x121281190>

Questons?

Presentation available online:

https://support.riverbed.com/apis/steelscript/SharkFest2014.slides.html

SteelScript for Python:

https://support.riverbed.com/apis/steelscript/index.html