The developers of Workbench feel like Medium-Data is a sweet spot, large enough to be meaningful for model generation, statistics and predictive performance but small enough to allow for low latency, fast interaction and streaming ‘hyperslabs’ from server to client.
Many of our examples (notebooks) illustrate the streaming generator chains that allow a client (python script, IPython notebook, Node.js, CLI) to stream a filtered subset of the data over to the client.
Once you efficiently (streaming with zero-copy) populate a Pandas dataframe you have access to a very large set of statistics, analysis, and machine learning Python modules (statsmodel, Pandas, Scikit-Learn).
Workbench server will run great on a laptop but when you’re working with a group of researchers the most effective model is a shared group server. A beefy Dell server with 192Gig of Memory and a 100 TeraByte disk array will allow the workbench server to effectively process in the neighborhood of a million samples (PE Files, PDFs, PCAPs, SWF, etc.)
As you’ve noticed from many of the documents and notebooks, Workbench often defaults to using a local server. There are several reasons for this approach:
All clients have a -s, –server argument:
$ python pcap_bro_indexer.py # Hit local server
$ python pcap_bro_indexer.py -s = my_server # Hit remote server
If you always hit a remote server simply change the config.ini in the clients directory to point to the groupserver.:
server_uri = localhost (change this to whatever)
Okay I’ve changed my config.ini file, and now it shows up when I do a ‘$ git status’. How do I have git ignore it?:
git update-index --assume-unchanged workbench/clients/config.ini
git update-index --assume-unchanged workbench/server/config.ini
In general workbench should be treated like any other python module and it shouldn’t add any complexity to existing development/QA/deployment models. One suggestion (to be taken with a grain of salt) is simply to use git braches.:
$ git checkout develop (on develop server)
$ git checkout master (on prod server)