id
type
5 (blog/news article)
status
21 (imported old-v2, waiting for another import)
review version
0
cleanup version
0
pending deletion
0 (-)
created at
2025-11-17 19:55:51
updated at
2025-11-17 19:55:53
url
https://gregreda.com/2013/07/15/unix-commands-for-data-science/
url length
63
url crc
2323
url crc32
2915305747
location type
1 (url matches target location, page_location is empty)
canonical status
2 (missing canonical tag in html)
canonical page id
-
domain id
domain tld
2211
domain parts
0
originating warc id
-
originating url
https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279968.16/warc/CC-MAIN-20250807151203-20250807181203-00783.warc.gz
source type
11 (CommonCrawl)
server ip
Publication date
2025-08-07 15:38:00
Fetch attempts
0
Original html size
19639
Normalized and saved size
19639
title
Useful Unix commands for data science
excerpt
content
Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column. How would you do it? Writing a script in python/ruby/perl/whatever would probably take a few minutes and then even more time for the script to actually complete. A database and SQL would be fairly quick, but then you'd have load the data, which is kind of a pain. Thankfully, the Unix utilities exist and they're awesome. To get the sum of a column in a huge text file, we can easily use awk. And we won't even need to read the entire file into memory. Let's assume our data, which we'll call data.csv, is pipe-delimited ( | ), and we want to sum the fourth column of the file. cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }' The above line says: Use the cat command to stream (print) the contents of the file to stdout. Pipe the streaming contents from our cat command to the next one - awk. With awk:...
author
Greg Reda
updated
1763677254
block type
0
extracted fields
109
extracted bits
featured image
article author
title
full content
content was extracted heuristically
detected location
0
detected language
1 (English)
category id
index version
2025123101
paywall score
0
spam phrases
0
text nonlatin
0
text cyrillic
0
text characters
4922
text words
1193
text unique words
434
text lines
1
text sentences
60
text paragraphs
1
text words per sentence
19
text matched phrases
3
text matched dictionaries
3
links self subdomains
0
links other subdomains
1
links other domains
12
links spam adult
0
links spam random
0
links spam expired
0
links ext activities
2
links ext ecommerce
0
links ext finance
0
links ext crypto
0
links ext booking
0
links ext news
0
links ext leaks
0
links ext ugc
26
links ext klim
0
links ext generic
0
image author
featured image