Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important — and most underappreciated — step in the ML pipeline.
Repository selection uses quality proxies to filter the hundreds of millions of public repos down to a manageable, high-quality subset:
- Minimum stars — popularity as a quality signal
- Active maintenance — recent commits indicate a living project
- Non-fork status — avoid counting duplicated repositories
- Proper licensing — ensure legal use for training
- Meaningful commit history — enough data to be useful
Tools for Mining at Scale
Specialized APIs and search platforms make large-scale dataset construction possible: the GitHub REST & GraphQL API (rate-limited, 5,000 req/hour with auth), SEART-GHS (a search engine for GitHub repos with advanced filtering), and libraries like PyGitHub (Python), Octokit (JS), and go-github (Go).
Here’s how to query GitHub’s API for high-quality Java repositories, filtering by stars and excluding forks:
def fetch_top_java_repos(num_repos=200, per_page=100):
repos = []
page = 1
while len(repos) < num_repos:
url = "https://api.github.com/search/repositories"
params = {
"q": "language:java stars:>1000",
"sort": "stars",
"order": "desc",
"per_page": per_page,
"page": page
}
response = requests.get(url, params=params)
data = response.json()
for item in data.get("items", []):
if item.get("fork", False):
continue
repos.append({
"full_name": item["full_name"],
"clone_url": item["clone_url"],
"stars": item["stargazers_count"],
})
page += 1
return repos[:num_repos]
Once we have the repo list, we shallow-clone each one — --depth 1 grabs only the latest snapshot, saving time and disk space:
def clone_repo(clone_url, dest_dir):
cmd = ["git", "clone", "--depth", "1", "--quiet", clone_url, dest_dir]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
return result.returncode == 0