Stylometric Analysis: Satoshi Nakamoto

Michael Chon
Towards Data Science
7 min readDec 26, 2017

Abstract:

Natural Language Processing tools were applied to the Satoshi Nakamoto’s Bitcoin paper to compare it to numerous cryptocurrency-related papers in an attempt to identify the true identity of the unknown Satoshi Nakamoto. There are two parts to the paper; the first part is stylometric analysis on the linguistic features generated and n-grams of each document in the corpus consisting of the relevant literature listed on Satoshi Nakamoto Institute and using machine learning models of the linguistic features to predict an author/authors on the Satoshi Nakamoto’s Bitcoin paper and his personal email texts. The second part is semantic similarity analysis where the content of each document in the corpus is compared in terms of semantic similarity number using the built-in functions in spaCy and gensim. The results from the two parts suggested which author/authors in the corpus are linguistically and semantically similar to Satoshi Nakamoto.

1 Problem Statement

Bitcoin has been a long-lasting peer-to-peer digital cryptocurrency for people who are skeptical of a current monetary system that is heavily controlled by third parties such as central and commercial banks. Bitcoin has come to the world’s attention not just because of the cryptocurrency itself, but also because of an algorithm behind Bitcoin, which is called blockchain. The real identity of Satoshi Nakamoto, who is known as the creator of Bitcoin and blockchain, has been an intensely debated topic among the members of the Bitcoin community. Since Satoshi Nakamoto and people involved in the early stage of this Bitcoin project only interacted via email, nobody has seen and interacted with him in real person; therefore, his identity is still unknown. Satoshi Nakamoto, who had refused to reveal himself to the public due to his privacy concern, left a few write-ups; one of the write-ups is the paper called “Bitcoin: A Peer-to-Peer Electronic Cash System” describing how Bitcoin works using blockchain and another is a few email exchanges between Satoshi and the people who were involved in the early stage of Bitcoin.

The paper examines to answer one question regarding Satoshi Nakamoto, “Who is/are linguistically and semantically similar to Satoshi Nakamoto?” The paper applies stylometric and semantic similarity analyses on the relevant literature listed on Satoshi Nakamoto Institute, the Bitcoin paper, and Satoshi’s email exchanges to find out who is/are linguistically and textually similar to Satoshi Nakamoto. Stylometric analysis is an analysis of linguistic style used to suggest whether a text belongs to a certain author based on linguistic features. Semantic similarity analysis is an analysis used to indicate whether the content or meaning of a text is similar to the content of another or not.

The true identity of Satoshi Nakamoto is important to the Bitcoin community. Satoshi has been known to have approximately 1 million Bitcoins or 7 percent of the total Bitcoin supply. He has the strongest influence on the economy of Bitcoin; if he decides to sell some of his Bitcoins into the market, the market will respond to the change by possibly devaluing all the existing Bitcoins. Furthermore, manifesting the true identity of Satoshi Nakamoto can bring many upgrades to blockchain and new applications of blockchain in fields other than finance can be introduced.

2 Data

The data were gathered using a python module called Article. The literature listed on Satoshi Nakamoto Institute only in a format of HTML were collected using the module. The authors of the collected literature have been known as possible candidates for Satoshi Nakamoto in the Bitcoin community such as Hal Finney, Ian Grigg, Nick Szabo, Timothy C. May, and Wei Dai. A total of 29 documents was collected, including 6 Hal Finney texts, 2 Ian Grigg texts, 16 Nick Szabo texts, 2 Timothy C. May texts, 1 Wei Dai text, and 2 Satoshi Nakamoto texts, where one of them is the Bitcoin paper, and another is his email exchange texts with others. The texts of every author in the data except Satoshi Nakamoto were combined into one single text file. The training corpus contains a single combined text of every author except Satoshi Nakamoto and the test corpus contains two individual Satoshi Nakamoto’s texts.

3 Methodology

The stylometric analysis has three components, linguistic features, classification algorithms, and n-grams. A total of 10 linguistic features was generated and used to compare from author to author in the corpuses. These features were generated using sent_tokenize function and stopwords built in the nltk module. The descriptions of the features are provided in Table 1. Classification algorithms such as Support Vector Machine, Random Forest, and Gaussian Naive Bayes, were used to classify the Satoshi Nakamoto’s Bitcoin paper and his email exchanges as one of the authors of the training corpus. The algorithms were trained with the features of the authors except those of Satoshi Nakamoto and were all implemented in python using the scikit-learn module.

In addition, n-grams of each document in the corpuses, where n is from 1 to 4, were produced using the nltk module. The tokens in each document were lemmatized using nltk.WordNetLemmatizer to prevent the same word from being counted as another word due to plurality. First, 1-gram, called uni-gram, was generated with and without stopwords. Afterwards, bigram, trigram, and quadgram were created, compared, and analyzed to see if Satoshi Nakamoto repeats a certain pattern of words in order and other authors use the same pattern in their writings.

The semantic similarity analysis was done using the built-in functions that compute semantic similarity in spaCy and gensim implemented in python. In spaCy, .similarity() method was used to compare the content of one document to that of another and determine the similarity using a number between 0 and 1, where 0 means the two documents are not related to each other and 1 means that the contents of the two documents are identical. spacy.load(‘en_core_web_lg’), which consists of 300-dimensional word vectors trained on Common Crawl with GloVe and 1.1m keys and 1.1m unique vectors (300 dimensions), was used. In gensim, similarities.MatrixSimilarity() was used in computing the cosine similarity using a number between -1 and 1, where the closer to 1 the number is, the more similar two documents are to each other in terms of content.

4 Results

According to the classification algorithms in Table 3, they all predicted that Nick Szabo is linguistically similar to Satoshi who had written the Bitcoin paper and Ian Grigg is linguistically similar to Satoshi who had exchanged the emails. In Table 4, there are two unigrams, (‘would’, 31) and (‘one’, 29) in Satoshi’s email exchanges. The word ‘would’ is used by Hal Finney 28 times and the word ‘one’ is used by Nick Szabo 199 times. There is one unigram, the word ‘contract’, commonly used by Ian Grigg and Nick Szabo.

From spaCy (Table 5), Wei Dai has the highest similarity score to the Bitcoin paper and Hal Finney has the highest similarity score to Satoshi’s email exchanges. From gensim (Table 6), Timothy C. May has the highest similarity score to the Bitcoin paper and Ian Grigg has the highest similarity score to Satoshi’s email exchanges. An unusual result is that Ian Grigg has a similarity score of .99996 to Satoshi’s email exchanges (rounded up to 1.0 in the table).

5 Conclusion

Based on the results, Satoshi who had written the Bitcoin paper may not be the same Satoshi who had exchanged emails. Satoshi Nakamoto may possibly be more than one person; Satoshi Nakamoto is a pseudonym for a team of computer scientists and cryptographers who were involved in creating Bitcoin and blockchain. Nick Szabo and Ian Grigg are the two authors who are linguistically similar to Satoshi Nakamoto in the Bitcoin paper and his email texts, respectively. In addition, Wei Dai and Timothy C. May are two potential candidates for the Bitcoin paper in terms of semantic similarity. Hal Finney and Ian Grigg are two possible candidates for Satoshi’s email exchanges. Since it is a known fact that Hal Finney had interacted with Satoshi Nakamoto via email, Hal Finney should not be included in the list of possible candidates for Satoshi who exchanged emails; Ian Grigg is linguistically and semantically similar to Satoshi Nakamoto. Therefore, the possible candidates for Satoshi Nakamoto are Nick Szabo, Ian Grigg, Wei Dai, and Timothy C. May.

6 Discussion

Satoshi Nakamoto used the phrase “proof-of-work” repeatedly throughout the Bitcoin paper and Nick Szabo is the only author of the training corpus who used the same exact phrase in his blog post called Bit gold. It supports a theory that Nick Szabo is very close to Satoshi in terms of linguistic style. The document distances of every author of the corpus in 2-dimensional spaces using multidimensional scaling, MDS on sklearn were visualized. In Pic 1, the distance between Ian Grigg and Nick Szabo is the shortest, suggesting that Ian Grigg and Nick Szabo are closely related to each other, which might not be a coincidence. Wei Dai and Timothy C. May are far away from each other and Nick Szabo and Ian Grigg, possibly suggesting that Wei Dai and Timothy C. May are not strong candidates for Satoshi Nakamoto compared to Nick Szabo and Ian Grigg.

7 Future Work

The classification algorithms trained with only 5 data points consisting of 10 features are worrisome. In an ideal data science world, machine learning models need to be trained with a bigger training sample with cross-validation. Along with the absolute frequency of n-grams, the relative frequency of n-grams can be added to the study. In addition, Craig Steven Wright, who claimed himself as Satoshi Nakamoto, can be added, if possible, as one of the authors of the training corpus because the algorithms and semantic similarity allow comparing him to the authors of the current training corpus. It would be interesting to see if he would have outperformed Nick Szabo and Ian Grigg, who are the two strongest candidates for Satoshi Nakamoto.

p.s: I just realized that I cannot insert tables here so I shared the paper on Google Drive instead.

--

--

Responses (8)