What is WhoIsWho?
WhoIsWho owning, a world’s largest manually-labeled benchmark with over 1,000,000 papers built using an interactive annotation process,
A regular leaderboard with comprehensive tasks, i.e., From-scratch Name Disambiguation, Real-time Name Disambiguation, and Incorrect Assignment Detection. The historical contests of WhoIsWho have already attracted more than 3,000 researchers.
An easy-to-use toolkit encapsulating the entire pipeline as well as the most powerful features and baseline models for tackling the tasks.
Please refer to the WhoIsWho paper for more details.
Get Started
It is fairly easy for you to get started with WhoIsWho. All the data are available through Data Download from WhoIsWho page .
The regualr leaderboard shown in this page is based on the na-v3 version of data, specifically,
Training Set
na-v1, na-v2, and the training part of na-v3;
Validation Set
validation part of na-v3;
Test Set
test part of na-v3.
To preserve the integrity of test results, we do not release the test set labels to the public. Instead, we require you to submit your predicted results to the corresponding contest so that we can evaluate them.
We will automically launch your model performance on the leaderboard, you can also provide detailed information via WhoIsWho email , such as model name, organization, paper/code linkage, etc, which will be shown on the leaderboard.
Have Questions?
Ask us questions at oagwhoiswho@gmail.com.
Citation
If you think WhoIsWho is helpful to you, please cite the following papers,
@article{chen2023web,
   title={Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit},
   author={Chen, Bo and Zhang, Jing and Zhang, Fanjin and Han, Tianyi and Cheng, Yuqing and Li, Xiaoyan and Dong, Yuxiao and Tang, Jie},
   journal={arXiv preprint arXiv:2302.11848},
year={2023}}

@inproceedings{tang2008arnetminer,
   title={Arnetminer: extraction and mining of academic social networks},
   author={Tang, Jie and Zhang, Jing and Yao, Limin and Li, Juanzi and Zhang, Li and Su, Zhong},

booktitle={Proceedings of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining},
   pages={990--998},
   year={2008}
}
icon Leaderboard for From-scratch Name Disambiguation
  • Rank
  • Method
  • Organization
  • References
  • Metric
    (P-F1)
icon Leaderboard for Real-time Name Disambiguation
  • Rank
  • Method
  • Organization
  • References
  • Metric
    (W-F1)
Data
Benchmark
na-v3
The na-v3 competition data split.
Train.
1.train_author.json [Download]
2.train_pub.json [Download]
Data Format Descriptions [Download]
Validation.
I. Name Disambiguation from Scratch
1.sna_valid_author_raw.json[Download]
2.sna_valid_pub.json[Download]
3.sna_valid_example_evaluation_scratch.json[Download]
4.sna_valid_author_ground_truth.json[Download]
Data Format Descriptions [Download]
​II. Incremental Name Disambiguation
Existing Author Profiles.
1.whole_author_profile.json[Download]
2.whole_author_profile_pub.json[Download]
New (Unassigned) Papers.
1.cna_valid_unass_competition.json[Download]
2.cna_valid_pub.json[Download]
3.cna_valid_example_evaluation_continuous.json[Download]
4.cna_valid_author_ground_truth.json[Download]
Data Format Descriptions [Download]
Test:
I. Name Disambiguation from Scratch
1.sna_valid_author_raw.json[Download]:same as sna_valid_author_raw.json;
2.sna_test_pub.json [Download]:the file contains the specific paper information which belongs to sna_test_author_raw.json (The same format with train_pub.json).
​II. Incremental Name Disambiguation
1.cna_test_unass_competition.json [Download]: same as cna_valid_unass_competition.json
2.cna_test_pub.json [Download]: the file contains the specific paper information which belongs to cna_test_unass_competition.json (The same format with train_pub.json).​
na-v1
Summary.
The whole dataset of na-v1 version.
1.na_v1_author.json [Download]
2.na_v1_pub.json [Download]
Data Format Descriptions [Download]
Competition Format.
The na-v1 competition data split.
Train.
1.train_author.json [Download]
2.train_pub.json [Download]
Data Format Descriptions [Download]
Validation.
I. Name Disambiguation from Scratch
1.sna_valid_author_raw.json [Download]
2.sna_valid_pub.json [Download]
3.sna_valid_example_evaluation_scratch.json [Download]
4.sna_valid_author_ground_truth.json [Download]
Data Format Descriptions [Download]
II. Continuous Name Disambiguation​​
Existing Author Profiles.
1.whole_author_profile.json [Download]
2.whole_author_profile_pub.json [Download]
New (Unassigned) Papers.
1.cna_valid_unass_competition.json [Download]
2.cna_valid_pub.json [Download]
3.cna_valid_example_evaluation_continuous.json [Download]
4.cna_valid_author_ground_truth.json​ [Download]
Data Format Descriptions [Download]
Test:
I. Name Disambiguation from Scratch
1.sna_test_author_raw.json​ [Download]:same as sna_valid_author_raw.json;
2.sna_test_pub.json [Download]:the file contains the specific paper information which belongs to sna_test_author_raw.json (The same format with train_pub.json).
3.sna_test_author_ground_truth.json [Download]:the right clustering results of the sna_test_author_raw.json.
II. Continuous Name Disambiguation​
1.cna_test_unass_competition.json [Download]: same as cna_valid_unass_competition.json
2.cna_test_pub.json [Download]: the file contains the specific paper information which belongs to cna_test_unass_competition.json (The same format with train_pub.json).​
3.cna_test_author_ground_truth.json [Download]: the right assignment results of the cna_test_unass_competition.json.
na-v2
Summary.
The whole dataset of na-v2 version.
1.na_v2_author.json [Download]
2.na_v2_pub.json [Download]
Data Format Descriptions [Download]
Competition Format.
The na-v2 competition data split.
Train.
1.train_author.json [Download]
2.train_pub.json [Download]
Data Format Descriptions [Download]
Validation.
I. Name Disambiguation from Scratch
1.sna_valid_author_raw.json [Download]
2.sna_valid_pub.json [Download]
3.sna_valid_example_evaluation_scratch.json [Download]
4.sna_valid_author_ground_truth.json [Download]
Data Format Descriptions [Download]
II. Continuous Name Disambiguation​​
Existing Author Profiles.
1.whole_author_profile.json [Download]
2.whole_author_profile_pub.json [Download]
New (Unassigned) Papers.
1.cna_valid_unass_competition.json [Download]
2.cna_valid_pub.json [Download]
3.cna_valid_example_evaluation_continuous.json [Download]
4.cna_valid_author_ground_truth.json​ [Download]
Data Format Descriptions [Download]
Test:
I. Name Disambiguation from Scratch
1.sna_test_author_raw.json​ [Download]:same as sna_valid_author_raw.json;
2.sna_test_pub.json [Download]:the file contains the specific paper information which belongs to sna_test_author_raw.json (The same format with train_pub.json).
3.sna_test_author_ground_truth.json [Download]:the right clustering results of the sna_test_author_raw.json.
II. Continuous Name Disambiguation​
1.cna_test_unass_competition.json [Download]: same as cna_valid_unass_competition.json
2.cna_test_pub.json [Download]: the file contains the specific paper information which belongs to cna_test_unass_competition.json (The same format with train_pub.json).​
3.cna_test_author_ground_truth.json [Download]: the right assignment results of the cna_test_unass_competition.json.
Note:
1. For NA-v1, the model performance on the test set should be significantly better than that on the validation set. Because of that, For those author names which have large ambiguities (with a large number of authors under the same author name.), we only select a part of ambiguate-authors with the same author name to labeled by humans. When we construct the validation set, we still keep all the authors with the same name for the purpose of reproducing the actual online situation. So, There inevitably have some noise in the validation set. When we construct the test set, we only keep the manually labeled authors with the same author name in order to purely judging the model performance.
2. We did not release the test ground truth for a fair comparison. We recommend you to evaluate your models on the validation set first, and then submit your model and the detailed running documents to the mail: oagwhoiswho@gmail.com. Please give us a few days to evaluate your model, and we will send you a reply after the evaluation (we are constructing a public evaluation platform, which will be launched soon);
3. If you want to launch your model performance on the leaderboard, please inform us explicitly, and then introduce your model name and the author's affiliations.