Main

type

0 (not classified)

status

21 (imported old-v2, waiting for another import)

review version

0

cleanup version

0

pending deletion

0 (-)

created at

2025-10-25 06:12:32

updated at

2025-10-25 06:12:33

Address

url

https://opir.columbia.edu/understanding-columbias-common-data-set

url length

65

url crc

15531

url crc32

2579119275

location type

1 (url matches target location, page_location is empty)

canonical status

10 (verified canonical url)

canonical page id

2817584539

Source

domain id

3528414

domain tld

2295

domain parts

0

originating warc id

-

originating url

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151280504.46/warc/CC-MAIN-20250811162052-20250811192052-00515.warc.gz

source type

11 (CommonCrawl)

Server response

server ip

162.159.138.64

Publication date

2025-08-11 17:14:18

Fetch attempts

0

Original html size

110467

Normalized and saved size

92501

Content

title

Understanding Columbia's Common Data Set | Columbia OPIR

excerpt

content

author

updated

1763389761

Text analysis

block type

0

extracted fields

8

extracted bits

title

detected location

0

detected language

1 (English)

category id

Edukacja (47)

index version

2025110801

paywall score

0

spam phrases

0

Text statistics

text nonlatin

0

text cyrillic

0

text characters

32034

text words

5590

text unique words

1220

text lines

247

text sentences

224

text paragraphs

77

text words per sentence

24

text matched phrases

175

text matched dictionaries

6