The availability of transaction traces (also known as internal transactions) in the public Ethereum dataset on BigQuery has opened new possibilities for blockchain data analysis. This enriched dataset allows you to query the balances of all Ethereum addresses efficiently. Whether you're a developer, researcher, or analyst, understanding how to leverage this data can provide deep insights into the Ethereum ecosystem.
This guide walks you through the process of querying Ethereum address balances, explains the underlying methodology, and addresses common challenges and limitations you might encounter.
Querying Top Ethereum Balances
To retrieve the top Ethereum addresses by balance, you can use a straightforward SQL query in BigQuery. The balances table, which is updated daily, contains the current Ether holdings for every address.
Here's a sample query to get the top 10 addresses with the highest balances:
#standardSQL
select *
from `bigquery-public-data.crypto_ethereum.balances`
order by eth_balance desc
limit 10This query will return the addresses and their corresponding ETH balances, sorted from highest to lowest.
For those requiring real-time balance data, a more advanced query is necessary. 👉 Explore real-time query methods
How Ethereum Balances Are Calculated
The process of generating the balances table involves several complex steps to ensure accuracy and completeness. Here's a high-level overview of the workflow:
- Data Export: All transaction traces are exported from a Parity Ethereum client node into BigQuery using the open-source Ethereum ETL tool.
- Trace Enrichment: The raw traces are enriched with a
statusfield. This field accounts for failures in parent traces that might affect nested calls. - Adding Special Traces: Critical, non-standard state changes that are not captured by standard traces—specifically from the Genesis block and the DAO fork event—are manually added to the dataset.
- SQL Calculation: A final SQL query aggregates the flow of Ether from all transactions and traces to compute the final balance for every address.
Accounting for Trace Statuses
A significant challenge in calculating accurate balances is handling transaction reversals. A nested call within a transaction might appear successful in the trace data, but if the top-level transaction fails (e.g., due to an out-of-gas error), all state changes from that nested call are reverted.
The solution involves reconstructing the call tree from the flat list of traces. By first identifying all failed top-level traces and then marking all their child traces as failed, we can ensure that balance changes from reverted transactions are correctly excluded. The status field in the final dataset is set to 0 for success and 1 for failure, mirroring the status field in Ethereum transaction receipts.
Incorporating the DAO Fork State Changes
A historical event known as the DAO fork created irregular state changes that are not reflected in standard Parity traces. If ignored, the balances for addresses like WhiteHatDAO and TheDarkDAO would be incorrect.
To address this, the specific state changes outlined in Ethereum Improvement Proposal (EIP) 779 were manually queried and hardcoded into the Ethereum ETL tool. These are ingested into BigQuery with a trace_type of daofork. Similarly, the initial Ether allocations from the Genesis block are added with a trace_type of genesis.
Known Limitations and Issues
While the dataset is extensive, it's important to be aware of its current limitations. A comparison with balances reported directly by Ethereum nodes helped identify a small set of addresses where the query results show discrepancies.
The addresses known to be affected are:
0x4509008d923ef571fc1d29fd66d3135fa02f0b640xe5449e9a4f31c38d926b76f76571e5d0b143ef5d0x00000000000000000000000000000000000000010x1f78775c8260df084f9a0e5fbdf06487b875ac4d0x0000000000000000000000000000000000000003
The root cause is believed to be a bug in the Parity client that resulted in missing traces for calls to precompiled contracts. A future update to the dataset will involve resyncing the Parity nodes and rerunning the ETL process to resolve these inaccuracies.
Frequently Asked Questions
How often is the balances table updated?
The bigquery-public-data.crypto_ethereum.balances table is updated on a daily basis. It provides a daily snapshot of all Ethereum address balances.
Can I get real-time balance data from BigQuery?
While the public table is updated daily, you can write more complex SQL queries that join the balances table with real-time transaction data to approximate live balances. 👉 Get advanced query techniques
Why are there missing traces for some addresses?
The known discrepancies are due to a technical issue in the Parity client software used to generate the traces. This issue affects a very small number of addresses, primarily precompiled contracts, and is scheduled to be fixed in a future dataset update.
What are 'genesis' and 'daofork' traces?
These are special types of traces added manually. 'Genesis' traces represent the initial Ether distribution when the Ethereum network launched. 'Daofork' traces represent the irregular state changes executed during the DAO hard fork to move Ether from compromised contracts.
Is this dataset available for other blockchains?
The public BigQuery dataset primarily features Ethereum data. However, the same ETL methodology can be and is being applied to other blockchain networks to create similar public datasets.
How can I track balance changes over time?
You can write temporal queries that join the traces table with block time information. This allows you to calculate the balance of any address at any specific block height or date in the past.