The project combines blockchain data, machine learning, and cloud managed services into a final product as a web app. Ethereum data is available as a GCP public dataset.
The web app predicts the investor profile (cluster) of an Ethereum address that is entered in the user interface. This is based on three features that are extracted from the data source: Current balance, unique transfers, and unique tokens held.
In the finance lingo, an investor profile defines an individual’s preference in investing decisions. Examples of this are risk-averse/risk-tolerant, diversity of asset classes and individual assets, investment in growth stocks or value stocks, etc. In this project, it refers to any kind of investing behavior that can be quantified and used to differentiate between different Ethereum addresses.
The data is available on BigQuery and is ready to be used. The challenge is in modeling the data and using it to feed the ML models. The data is not labeled which is why the problem is formulated as an unsupervised machine learning task. As pricing data is both more accessible and monetizable, most of the crypto machine learning projects focus on predicting that.
On the other hand, this project focuses on behavioral footprints of individual addresses (investors) and clustering them based on available transactional data.
AWS's cloud services were used to provide the model in the web UI. AWS SageMaker for all the machine learning steps and to serve the model as a prediction endpoint. This is then triggered as a Lambda function through the deployed API endpoint with API Gateway. This endpoint is used to predict the Ethereum investor profile in the web UI.
While the data on public blockchains are open, there are not many machine learning applications yet (apart from price predictions) that leverage them. One of the reasons is that the underlying data structures are based on OLTP systems which are difficult to analyze (as opposed to OLAP).
For every transaction that is executed on the public blockchains, there is a trace that we can analyze. As opposed to other asset classes, we have available data into the behavior of individual investors. The aim of the project is to answer the question:
Can we use the principle of BYOD (bring your own data) and cluster an Ethereum investor only with the provided address?
Considering the complexity of the problem, I didn't expect to get results that would be necessarily insightful. There are currently almost 100 million unique Ethereum addresses which are highly diverse. The goal was to build a workable solution. And for that, some assumptions needed to be made.
One core (over)simplifying assumption is that one address equals one investor. This is hard to argue as there are many subgroups in the Ethereum ecosystem such as miners, exchanges, and ICO wallets. Meanwhile, differentiating between those is beyond the scope of this project which focuses on building a proof of concept pipeline.
With a given number of clusters, each model clusters all the sampled Ethereum addresses into separate clusters. The metric used in this project is the Silhouette score. The coefficient is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. This is ideal for the use-case as we don’t have ground truth values available.
A sample of 10000 Ethereum addresses is extracted. As some of them don’t satisfy the further conditions in the specified query, about 8000 are available in the data frame. As is visible in the below table, the minimum threshold for ether_balance is 1 Ether. This was an arbitrarily set boundary as a lot of Ethereum addresses have a zero balance.
The mean is 138 Ether (at the moment about 25k euros as of May 15, 2020) which indicates the skewness of the distribution when we look at the 75th percentile value. While the 75th percentile is less than 13 Ether, the average is 138. There are clearly some strong outliers in the sample such as the largest value with 300000 Ether balance.
The unique_tokens feature is not as dispersed at the Ether balance. The usual value varies from 2 unique tokens to 8 tokens held.
The Unique_transfers feature varies widely. While the mean number of transfers is 26, the largest amount is 28560. We can see that the top 25% of addresses generate most of the transfers as none of the addresses in the 75 percentile has more than 1 transfer.
Below is a scatter matrix below for all constructed features. As it’s clear in the visualization, there is a power-law distribution present in both eth_balance and unique_transfers features. The vast majority of the cumulative value of both is generated by a minority of addresses.
Meanwhile, there is not much correlation visible between different features.
There is a correlation heatmap below for all available features. As was already indicated above, there is little correlation between different features. That’s a positive sign as we want to explore different areas of the feature space with different features. If there would be a strong correlation we could be explaining the same phenomena with different features.
Algorithms and techniques
The techniques used to predict clusters are all unsupervised machine learning algorithms. The chosen ones are K-means, Gaussian Mixture Models, and Hierarchical clustering.
As in K-means, the boundaries between clusters are always linear, the other two algorithms are used to handle more complicated boundaries. Also, the possibility to extend the analysis further by measuring the probability or uncertainty of cluster assignments. This is not done in the analysis but it’s a possible extension.
The benchmark model for the project is the K-means algorithm’s Silhouette Score. After that, more sophisticated algorithms are used to compare the performance. As a rule of thumb, a Silhouette score of 0.5 is defined as a threshold for the solution to be considered with a reasonable performance.
Three features are extracted:
- Current Ethereum balance
- Unique tokens that have historically been held
- Unique transfers by the Ethereum address
There are quite a few Python modules used in the project. Meanwhile, these are some that are core for the implementation. Pandas_gbq library was used to query the BigQuery Ethereum data source. Google-cloud-bigquery was used to authenticate into the GCP service account. Scikit-learn was used as a tool for all models.
AWS services used in the project:
- S3 for storing the training data, pre-processing with the transformer object, and GCP credentials
- SageMaker to leverage notebook instances and connect the project end-to-end
- Lambda function to trigger the prediction script
- API Gateway to create an API that is used in the web UI that the user interacts with
- CloudWatch to log and debug possible issues that were encountered in the process of building the solution
Two tables from the Ethereum public data source are used:
Awesome Bigquery views Github page was used to construct the SQL queries and get a better understanding of the Ethereum data model.
Below we can see some examples for the constructed features for different Ethereum addresses.
After that, Sklearn’s pipeline functionality is used to preprocess the features. We normalize them to make sure that the features are useful to feed into the models. This is done by using power transform, standard scaling, and PCA transformation. PCA transformation might be redundant here as there are only three features mapped to three components but can be useful if the feature set is expanded and the dimensionality needs to be reduced considering the sample size.
With the final set of features, all three clustering algorithm alternatives are trained: K-means model, GMM model, and Hierarchical clustering model.
The visualization below plots the number of instances in each of the 4 clusters.
The chosen number of clusters is 4. It’s chosen based on trial and error of different values. As the performance of alternative algorithms doesn’t improve the Silhouette coefficient considerably, K-means is used in the solution that is used in the model that is deployed.
After the training using the training script, the trained model’s metadata is combined with the prediction script to predict the cluster value of individual Ethereum addresses.
The prediction script performs on-the-fly calculation of feature values of the provided Ethereum address. It queries the Ethereum data source and returns the normalized feature values. These are then used as input into the trained model which returns the predicted cluster value.
The final solution is a prediction endpoint triggered as a Lambda function through the API Gateway. The Lambda function is a cloud function based on a serverless architecture. It is deployed through the API Gateway as an API that can be used in the user interface of the web app.
Whenever a user enters an Ethereum address and presses the Submit button, the predicted cluster value is shown in the user interface.
Model Evaluation and Validation
The final result is a web app that returns an investor profile based on the user’s provided Ethereum address. The user doesn’t need to provide any of their own personal data as all data that is needed can be extracted from the publicly-available Ethereum data source.
The models are evaluated based on the Silhouette score. As there are currently more than 100 million existing Ethereum addresses, it is hard to reliably sample the addresses without making some explicit rules such as filtering out non-trivial balance amounts.
The model is evaluated empirically with the chosen metric. K-means’s Silhouette score is 0.38. As there are no ground truth values available there is no way to test this objectively in a scalable manner. On the other hand, the chosen metric quantifies well how different clusters are differentiated in the feature space.
The final solution is tested in a web app where the predicted cluster value is extracted for a sample Ethereum address.
Considering the performance it’s clear that there is not enough variance in the data extracted to be able to create interpretable clusters. When looking at different clusters it was not possible to find meaningful naming of clusters either.
The 0.5 thresholds for the Silhouette coefficient are not achieved which means that considering this metric we can’t describe the models to have a strong grasp of the underlying data. Meanwhile, the main goal of the project was to build a minimum viable solution that can then be improved.
The biggest bottlenecks of performance are the features that are fed into the models. With a better understanding of the underlying data model, we could construct a better feature set.
Besides that, the preprocessing could be improved. Besides that, the number of clusters was chosen with trial and error with the same values across all algorithms. This could be sped up by using some more sophisticated analysis (Silhouette analysis).
This project shows how it's possible to relatively quickly set up a machine learning pipeline and provide it as a data product in the web UI.
Considering the maturation of both data science and crypto industries, we will slowly see a shift where it will become more common to bring these technologies together in a way that was not possible before.
With cloud-managed services, a lot of complexity is abstracted away which helps data scientists who are often not trained in software engineering and DevOps/MLOps. Besides that, high-level APIs can help engineers build data products that were previously available only to specialized data professionals. With the progress of these domains, we can see a great deal of learning from each other which will enable a new paradigm of software products.
Some other projects working on crypto data solutions:
The project combines blockchain data, machine learning, and cloud managed services into a final product as a web app. Ethereum data is available as a GCP […]