Workbench is a client/server architecture. The ‘scalability’ of the architecture is determined by the put/get performance of the data storage backend (currently MongoDB). So the workbench framework is focused on bringing the work to the data. Meaning all the heavy lifting happens on the server side with workers streaming over the data. Super Important: No data is copied or moved, the only thing that happens is a sample is pulled from the data store once and than all of the workers in the current worker-chain operate on that sample. Afterward the sample is released from memory.
Although Workbench can scale up with the datastore. During development and testing we’re using it on ‘medium’ data. The developers of Workbench feel like Medium-Data is a sweet spot, large enough to be meaningful for model generation, statistics and predictive performance but small enough to allow for low latency, fast interaction and streaming ‘hyperslabs’ from server to client.
Many of our examples (notebooks) illustrate the streaming generator chains that allow a client (python script, IPython notebook, Node.js, CLI) to efficiently stream a subset of data from the server to the client.
Once you efficiently (streaming with zero-copy) populate a Pandas dataframe you have access to a very large set of statistics, analysis, and machine learning Python modules (statsmodel, Pandas, Scikit-Learn).
Workbench server will run great on a laptop but when you’re working with a group of researchers the most effective model is a shared group server. A beefy Dell server with 192Gig of Memory and a 100 TeraByte disk array will allow the workbench server to effectively process in the neighborhood of a million samples (PE Files, PDFs, PCAPs, SWF, etc.)
As you’ve noticed from many of the documents and notebooks, Workbench often defaults to using a local server. There are several reasons for this approach:
All clients have a -s, –server argument:
$ python pcap_bro_indexer.py # Hit local server
$ python pcap_bro_indexer.py -s = my_server # Hit remote server
If you always hit a remote server simply change the config.ini in the clients directory to point to the groupserver.:
server_uri = localhost (change this to whatever)
In general workbench should be treated like any other python module and it shouldn’t add any complexity to existing development/QA/deployment models. One suggestion (to be taken with a grain of salt) is simply to use git braches.:
$ git checkout develop (on develop server)
$ git checkout master (on prod server)
The workbench project takes the workbench metaphore seriously. It’s a platform that allows you to do work; it provides a flat work surface that supports your ability to combine tools (python modules) together. In general a workbench never constrains you (oh no! you can’t use those 3 tools together!) on the flip side it doesn’t hold your hand either. Using the workbench software is a bit like using a Lego set, you can put the pieces together however you want AND adding your own pieces is super easy!.