Faust (https://faust.readthedocs.io/en/1.0/index.html) is a Python library for stream data processing. It allows you to asynchronously process data and connect the results between other channels and topics. I've recently been experimenting with Faust as a replacement for Celery (+Redis) since Kafka has the the advantage of guaranteed message ordering and delivery for each consumer that processes your datastream. This is especially important for usecases when you want to be sure all messages are processed; for fraud prevention, phishing detection, risk scoring etc.
If you've looked into Phishing detection, chances are you've also seen the CertStream project @ https://certstream.calidog.io. Certstream is an open source tool and free to use service that shows newly requested certificates in (near) realtime. Certstream monitors the Certificate Transparency lists and sends the latest changes over a Websocket to whomever connects to it.
I've been using Certstream for a while now to determine if a domain/certificate will be used for phishing. I would open a socket with Certstream and push the certificates to a Kafka topic and perform processing afterwards. Unfortionatly when a message is lost during my Certstream session I won't be able 'replay' the missed certificates. This could happen due to network disconnects, processing delays or exceptions occuring in your code.
For experimenting purposes I attempted to build a similar certificate monitoring tool that guarantees certificate processing by using Faust and Kafka to keep track of the downloaded certificate counts and processes. You can find my example Faust project @ https://github.com/d3vzer0/faust-transparency. I'm a Faust newbie, but getting started with the Faust framework was fairly trivial. The creators of Faust did a great job in creating an easy to understand abstraction layer for your streams that make use of Kafka.
When running my Faust sample project, an agent will be executed every 10 seconds to retrieve the up-to-date transparancy sources from Google's CT log. These sources will be pushed into the ct-sources topic. The get_tree_size agent is consuming all messages from this topic. When a new source message is seen, the agent will request the tree size from the individual transparency source. In short, the tree size is a total count of the available certificates in the source's transparency registry.
Using a Faust table (https://faust.readthedocs.io/en/latest/userguide/tables.html) I keep track of the total tree size per individual source. When the tree size is different compared to the last run I send the corresponding source to the ct-treesize-changed topic.
The proc_sources agent will process all sources that have updated entries in their transparency registry. The agent will perform an API call the CT source and download all the updated entries. The entries are base64 encoded and two fields are available:
- field_input: <base64 string>
- extra_data: <base64_string>
The format of the decoded data consists out of binary fields with the certificate chain included. Luckily the inventors of Certstream wrote a Medium article (https://medium.com/cali-dog-security/parsing-certificate-transparency-lists-like-a-boss-981716dc506) with Python snippets that can be used to properly decode the fields and extract the relevant certificates.
When the certificates have been decoded they will be pushed to the ct-certs topic. Simultaniously the state and count will be updated of each individual source. This makes sure that no duplicate certificates will be downloaded, and when your processes crashes, it will continue where it left off without missing a certificate. At least, that's the idea ;) I still need to finetune my experimental Faust script.
The faust-transparency repo also includes API's for PassiveTotal (RiskIQ) and Anomali (ThreatStream). The idea and main advantage of using Faust is that you can directly ingest and forward your transparency message to the other relevant topics to perform asynchronous enrichment as well. This gives you the ability to perform real-time risk scoring for certificates (and domains) that you process.
What comes afterwards is entirely up to you :) You could perform regex pattern matching or calculate the levenshtein distance for common keywords to detect phishing campaigns. Hopefully this gets you started with Faust for other usecases. Be sure to check out their work and start experimenting.
These are some cool projects that can help you out below: