Main

type

5 (blog/news article)

status

21 (imported old-v2, waiting for another import)

review version

0

cleanup version

0

pending deletion

0 (-)

created at

2025-11-17 19:55:51

updated at

2025-11-17 19:55:53

Address

url

https://gregreda.com/2013/07/15/unix-commands-for-data-science/

url length

63

url crc

2323

url crc32

2915305747

location type

1 (url matches target location, page_location is empty)

canonical status

2 (missing canonical tag in html)

canonical page id

-

Source

domain id

49335166

domain tld

2211

domain parts

0

originating warc id

-

originating url

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151279968.16/warc/CC-MAIN-20250807151203-20250807181203-00783.warc.gz

source type

11 (CommonCrawl)

Server response

server ip

3.162.112.127

Publication date

2025-08-07 15:38:00

Fetch attempts

0

Original html size

19639

Normalized and saved size

19639

Content

title

Useful Unix commands for data science

excerpt

content

Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column. How would you do it? Writing a script in python/ruby/perl/whatever would probably take a few minutes and then even more time for the script to actually complete. A database and SQL would be fairly quick, but then you'd have load the data, which is kind of a pain. Thankfully, the Unix utilities exist and they're awesome. To get the sum of a column in a huge text file, we can easily use awk. And we won't even need to read the entire file into memory. Let's assume our data, which we'll call data.csv, is pipe-delimited ( | ), and we want to sum the fourth column of the file. cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }' The above line says: Use the cat command to stream (print) the contents of the file to stdout. Pipe the streaming contents from our cat command to the next one - awk. With awk:...

author

Greg Reda

updated

1763677254

Text analysis

block type

0

extracted fields

109

extracted bits

featured image
article author
title
full content
content was extracted heuristically

detected location

0

detected language

1 (English)

category id

Zastosowania AI (149)

index version

2025123101

paywall score

0

spam phrases

0

Text statistics

text nonlatin

0

text cyrillic

0

text characters

4922

text words

1193

text unique words

434

text lines

1

text sentences

60

text paragraphs

1

text words per sentence

19

text matched phrases

3

text matched dictionaries

3