Regular XML snapshots of all Wikipedia articles, talk pages, and revision histories available for bulk download. Used in data engineering for building large-scale NLP corpora, knowledge graph extraction, full-text search indices, and training language models with Python processing tools like WikiExtractor.
Engineers download Wikipedia XML dumps from dumps.wikimedia.org and process them with `mwparserfromhell` or the `gensim.corpora.wikicorpus` module. The Wikiextractor tool strips markup to produce clean article text ready for tokenization and embedding.
Wikipedia Dumps are foundational for training and fine-tuning LLMs — virtually every major language model was pre-trained on Wikipedia text. For RAG, process Wikipedia dumps into chunked embeddings indexed in a vector database to build a comprehensive knowledge base that grounds LLM responses in factual, cited content.
# pip install gensim
# Download dump first: https://dumps.wikimedia.org/enwiki/latest/
from gensim.corpora.wikicorpus import WikiCorpus
wiki = WikiCorpus("enwiki-latest-articles.xml.bz2", dictionary=False)
for i, (text, tokens, title) in enumerate(wiki.get_texts()):
print(title, "-", len(tokens), "tokens")
if i >= 4:
breakOfficial dataset source
More datasets used by Python data engineers.
A curated repository of 600+ datasets covering classification, regression, clustering, and time-series tasks, widely used as machine learning benchmarks. Used in data engineering for building ML training pipelines, practising data preprocessing workflows, and loading tabular datasets into model training systems in Python.
Thousands of publicly available datasets hosted on GitHub repositories covering social media, finance, healthcare, sports, and scientific domains. Accessible directly via the GitHub API or raw download URLs, making them ideal for practising version-controlled data ingestion and automated dataset pipelines in Python.
The GTD, maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland, provides detailed information on terrorist attacks worldwide.