Working with Imported Data - the Base R#

Learning Objectives#

  • Set up the working directory and import data into R.

  • Explore and manipulate data frames to extract insights.

  • Perform basic data cleaning, including handling dates and missing values.

Now you have mastered the basics of R. But we really want to learn how to import data! So let us do that.

We have a CSV file of the Atlantic Hurricane Database (HURDAT2) 1851-2022. The data is released by the National Hurricane Center (NHC) at NOAA. Use this link to download: https://tufts.box.com/shared/static/gdqc9tdv7334622tkco37oyg9t2fyeaf.csv

But how do we import this data so we can use it in R?

Working Directory#

All relative file paths in R are relative to the current working directory, which might or might not be the location of the R script, depending on how RStudio was originally launched. If RStudio was launched by double-clicking on the R script or by right-clicking on the R script and then selecting Open With > RStudio, the default working directory will be the location of the R script. Otherwise the default working directory will either be your Documents directory (Windows) or your home directory (macOS and Linux).

We can use getwd() and dir() to explore the current working directory.

getwd()     # The current location of the working directory.
dir()       # The files in the working directory.
'/home/runner/work/guides/guides/source/source/r'
  1. '00_Getting_started_with_R.md'
  2. '01_Intro_to_RStudio.md'
  3. '02_intro-r_script excution and basics.Rmd'
  4. '03_working_with_imported_data.Rmd'
  5. '04_working_with_imported_data_tidyverse.Rmd'
  6. 'animation.gif'
  7. 'atlantic.csv'
  8. 'Fast_Loops_and_Parallel_Processing.md'
  9. 'index.md'
  10. 'r_data_dataviz'
  11. 'r-data-viz_files'

You can use the %in% operator to ensure that your working directory contains data file (atlantic.csv).

"atlantic.csv" %in% dir()
TRUE

If the statement above returns TRUE, you are all set. But if it returns FALSE, you need to change your working directory. Go to the top menu bar: Session > Set Working Directory > To Source File Location.

We can ensure the working directory is set correctly by either re-running the commands from above or clicking on the Files tab in the lower-right panel and then selecting More > Go To Working Directory and exploring the results.

Importing Data#

The atlantic.csv data file is in comma-separated values (CSV) format. (Note that the specifics of this format are outlined in standard RFC 4180.) This is a very common data format and you can easily import it in R as follows.

hurrdata <- read.csv("atlantic.csv")

The read.csv() function has numerous additional optional arguments that we can use to specify how exactly a data file should be read in and interpreted.

To investigate those, we can use the help() function or the ? operator.

?read.csv

Note that CSV files are different from Excel spreadsheets (XLS or XLSX files) and R does not contain the functionality to import the latter by default. An external community-developed package must be used to import Excel files. Installing and loading external packages is discussed later in this script.

Exploring Data#

We see that a new variable “hurrdata” has been added to the environment.

head(hurrdata)
A data.frame: 6 × 23
IDNameDateTimeEventStatusLatitudeLongitudeMaximum.WindMinimum.PressureRadii.NW.34ktRadii.NE.50ktRadii.SE.50ktRadii.SW.50ktRadii.NW.50ktRadii.NE.64ktRadii.SE.64ktRadii.SW.64ktRadii.NW.64ktMaximum.Wind.Radius
<chr><chr><int><int><chr><chr><chr><chr><int><int><int><int><int><int><int><int><int><int><int><int>
1AL011851UNNAMED18510625 0 HU28.0N94.8W80-999-999-999-999-999-999-999-999-999-999-999
2AL011851UNNAMED18510625 600 HU28.0N95.4W80-999-999-999-999-999-999-999-999-999-999-999
3AL011851UNNAMED185106251200 HU28.0N96.0W80-999-999-999-999-999-999-999-999-999-999-999
4AL011851UNNAMED185106251800 HU28.1N96.5W80-999-999-999-999-999-999-999-999-999-999-999
5AL011851UNNAMED185106252100LHU28.2N96.8W80-999-999-999-999-999-999-999-999-999-999-999
6AL011851UNNAMED18510626 0 HU28.2N97.0W70-999-999-999-999-999-999-999-999-999-999-999

This shows us a preview of the fist couple rows of the file. You can also click on the data table under Environment > Data.

We can use summary() to get descriptive statistics.

summary(hurrdata)
      ID                Name                Date               Time       
 Length:53976       Length:53976       Min.   :18510625   Min.   :   0.0  
 Class :character   Class :character   1st Qu.:19180804   1st Qu.: 600.0  
 Mode  :character   Mode  :character   Median :19640609   Median :1200.0  
                                       Mean   :19554281   Mean   : 910.8  
                                       3rd Qu.:19961020   3rd Qu.:1800.0  
                                       Max.   :20221111   Max.   :2330.0  
    Event              Status            Latitude          Longitude        
 Length:53976       Length:53976       Length:53976       Length:53976      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Maximum.Wind    Minimum.Pressure Radii.NE.34kt    Radii.SE.34kt   
 Min.   :-99.00   Min.   :-999.0   Min.   :-999.0   Min.   :-999.0  
 1st Qu.: 35.00   1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0  
 Median : 45.00   Median :-999.0   Median :-999.0   Median :-999.0  
 Mean   : 52.57   Mean   :-149.4   Mean   :-795.1   Mean   :-796.4  
 3rd Qu.: 70.00   3rd Qu.: 995.0   3rd Qu.:-999.0   3rd Qu.:-999.0  
 Max.   :165.00   Max.   :1024.0   Max.   : 780.0   Max.   : 720.0  
 Radii.SW.34kt    Radii.NW.34kt  Radii.NE.50kt    Radii.SE.50kt   
 Min.   :-999.0   Min.   :-999   Min.   :-999.0   Min.   :-999.0  
 1st Qu.:-999.0   1st Qu.:-999   1st Qu.:-999.0   1st Qu.:-999.0  
 Median :-999.0   Median :-999   Median :-999.0   Median :-999.0  
 Mean   :-801.4   Mean   :-799   Mean   :-806.1   Mean   :-806.4  
 3rd Qu.:-999.0   3rd Qu.:-999   3rd Qu.:-999.0   3rd Qu.:-999.0  
 Max.   : 660.0   Max.   : 600   Max.   : 360.0   Max.   : 300.0  
 Radii.SW.50kt    Radii.NW.50kt    Radii.NE.64kt    Radii.SE.64kt   
 Min.   :-999.0   Min.   :-999.0   Min.   :-999.0   Min.   :-999.0  
 1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0  
 Median :-999.0   Median :-999.0   Median :-999.0   Median :-999.0  
 Mean   :-807.7   Mean   :-807.1   Mean   :-809.2   Mean   :-809.3  
 3rd Qu.:-999.0   3rd Qu.:-999.0   3rd Qu.:-999.0   3rd Qu.:-999.0  
 Max.   : 360.0   Max.   : 390.0   Max.   : 180.0   Max.   : 250.0  
 Radii.SW.64kt    Radii.NW.64kt    Maximum.Wind.Radius
 Min.   :-999.0   Min.   :-999.0   Min.   :-999.0     
 1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0     
 Median :-999.0   Median :-999.0   Median :-999.0     
 Mean   :-809.7   Mean   :-809.5   Mean   :-973.2     
 3rd Qu.:-999.0   3rd Qu.:-999.0   3rd Qu.:-999.0     
 Max.   : 180.0   Max.   : 300.0   Max.   : 400.0     

We might wonder - what type of data is hurrdata? We imported it from csv, but how is it stored?

class(hurrdata)
'data.frame'

We can see that it is a “data.frame”, commonly referred to as a data frame. A data frame is a table where you have observations as rows and variables as columns. Data frames have some great features for working with data and are the go-to for R data storage. You can think of them almost as spreadsheets.

Working with Data Frames#

Let us say we want to access the maximum wind speed of the sixth observation. We can do this in multiple ways. Knowing that “Maximum.Wind” is the ninth column:

hurrdata[[6, 9]]    # [[row, column]]
hurrdata[[9]][6]    # [[column]][row]
70
70

NOTE: When using a data frame, we must use [[]] to access variable values. [] will just give us a subset of the data frame, not the values themselves.

Alternatively, we could use the column name:

hurrdata[["Maximum.Wind"]][6]
hurrdata$Maximum.Wind[6]
70
70

Note the dollar sign ($). This is a special operator that allows us to access data.frame variables (columns) based on their name.

The “$” operator is the preferred way of accessing data.frame variables based on their name, as it removes the complexity of when to use [[]] vs [] and does not require quotation marks … given your column names do not contain spaces.

Note how the read.csv() function automatically replaced spaces with periods in the column names to accommodate this. Underscores are also an acceptable alternative to spaces and other functions might use those instead of periods.

Selecting Data Based on Conditions#

Note how the summary() function from before did not provide much information on the columns containing textual (character/string) data. These columns can be analyzed further using the table() function. Can you guess what it does?

table(hurrdata$Name)

It creates a frequency table of all the unique values in the column! But what if we wanted to analyze hurricanes only with a specific name?

We can select rows based on a condition by combining logical operators with [] to subset all the rows where the logical operator returns TRUE. For example, all the hurricanes named Nicole can be extracted as follows.

hurrdata[hurrdata$Name == "NICOLE", ]

Note that we leave the column index blank to select all the columns.

Alternatively, we could extract values from one column based on the values of another column. For example, the maximum wind speed for all the hurricanes named Nicole could be obtained as follows.

max(hurrdata$Maximum.Wind[hurrdata$Name == "NICOLE"])
120

Data Cleaning: Dates & Strings#

Let us say we want to analyze maximum wind speed by year. Note how the date is stored as a number in YYYYMMDD format. This notation is great for sorting but very inconvenient for analysis.

We can use the “$” operator to easily extract the Data column as follows.

hurrdata$Date

Passing this column to the as.character() function will convert all the numbers into text (character data). Note the quotes around the new values.

as.character(hurrdata$Date)

Let us store these new character (string) representations of the dates in a new variable and use the substr() function to extract the year and the month.

date_strings <- as.character(hurrdata$Date)

Extract the year from the date string (position 1-4).

hurrdata$Year <- substr(date_strings, start = 1, stop = 4)

Extract the month from the date string (position 5-6).

hurrdata$Month <- substr(date_strings, 5, 6)

Convert both new variables to numeric to accommodate further analysis.

hurrdata$Month <- as.numeric(hurrdata$Month)
hurrdata$Year <- as.numeric(hurrdata$Year)

Note that this is a somewhat hack-y way of dealing with dates. It is suitable for simple conversions like this, but it is highly encouraged to use external packages specifically designed to work with dates for more complex tasks.

Data Cleaning: Missing Data#

Why are some wind speeds negative? If we looked into the metadata, we would find that we should have removed these! We can check on the negative data:

min(hurrdata$Maximum.Wind)
-99

We can combine [] with logical operators like before to find all the values less than zero (< 0) and replace them with “NA”, which means “No Data” in R.

hurrdata$Maximum.Wind[hurrdata$Maximum.Wind < 0] <- NA

We could also delete the whole observation, but this is bad practice! Let’s check on the results. It should print “NA”.

min(hurrdata$Maximum.Wind)
<NA>

To get the minimum wind speed excluding the NA values, we must call the min() function with na.rm = TRUE to instruct the function to ignore the NA values.

min(hurrdata$Maximum.Wind, na.rm = TRUE)
10

Sampling the Data#

We can also sample the data.frame to de-clutter the scatter plot. This is a two-step process in base R. First we use the sample() function to randomly select a desired quantity of row indexes/numbers for our data frame.

sample(nrow(hurrdata), 1000)
  1. 40398
  2. 40944
  3. 46501
  4. 53353
  5. 30485
  6. 28719
  7. 9675
  8. 40264
  9. 27787
  10. 16740
  11. 29617
  12. 34354
  13. 36087
  14. 3529
  15. 4513
  16. 33329
  17. 17997
  18. 1868
  19. 25392
  20. 10009
  21. 34883
  22. 39735
  23. 33828
  24. 37221
  25. 4572
  26. 35513
  27. 15482
  28. 46639
  29. 18423
  30. 14400
  31. 43862
  32. 3783
  33. 51029
  34. 4814
  35. 47937
  36. 17974
  37. 52866
  38. 39550
  39. 37752
  40. 9552
  41. 11343
  42. 35098
  43. 23577
  44. 38495
  45. 221
  46. 33499
  47. 34749
  48. 10818
  49. 43985
  50. 13936
  51. 8440
  52. 1453
  53. 48293
  54. 45831
  55. 8360
  56. 35450
  57. 17520
  58. 25789
  59. 40504
  60. 11517
  61. 32611
  62. 38064
  63. 42580
  64. 53162
  65. 30583
  66. 44203
  67. 52266
  68. 20150
  69. 38601
  70. 49857
  71. 51453
  72. 47340
  73. 50996
  74. 32895
  75. 10763
  76. 13043
  77. 53349
  78. 42314
  79. 435
  80. 47401
  81. 17216
  82. 38246
  83. 11726
  84. 22994
  85. 9261
  86. 51932
  87. 4885
  88. 24988
  89. 41585
  90. 9175
  91. 44805
  92. 40077
  93. 9309
  94. 48988
  95. 7247
  96. 5357
  97. 31580
  98. 24715
  99. 52643
  100. 50591
  101. 12648
  102. 50473
  103. 48196
  104. 5469
  105. 8823
  106. 43521
  107. 50397
  108. 9916
  109. 12010
  110. 4366
  111. 39408
  112. 14521
  113. 24286
  114. 48870
  115. 2911
  116. 20266
  117. 12936
  118. 18172
  119. 2187
  120. 2206
  121. 52830
  122. 5617
  123. 52347
  124. 28866
  125. 3846
  126. 52263
  127. 5023
  128. 46021
  129. 15188
  130. 38713
  131. 21343
  132. 3771
  133. 11676
  134. 41960
  135. 42998
  136. 16299
  137. 13431
  138. 6076
  139. 198
  140. 40252
  141. 2316
  142. 24151
  143. 35073
  144. 48429
  145. 6320
  146. 30193
  147. 9073
  148. 44646
  149. 43769
  150. 8498
  151. 22654
  152. 32526
  153. 53011
  154. 6745
  155. 6578
  156. 41566
  157. 13816
  158. 23420
  159. 10367
  160. 33229
  161. 50085
  162. 51807
  163. 33066
  164. 44690
  165. 14530
  166. 13149
  167. 25778
  168. 52675
  169. 17040
  170. 26069
  171. 51922
  172. 12066
  173. 6454
  174. 16598
  175. 30240
  176. 25955
  177. 14621
  178. 48877
  179. 5198
  180. 31355
  181. 12796
  182. 26979
  183. 17681
  184. 3844
  185. 11417
  186. 44592
  187. 46926
  188. 51661
  189. 28658
  190. 25210
  191. 48101
  192. 14120
  193. 8986
  194. 10782
  195. 27196
  196. 46650
  197. 15135
  198. 17098
  199. 11104
  200. 45626
  201. 43027
  202. 6208
  203. 4113
  204. 27859
  205. 23390
  206. 13559
  207. 36156
  208. 46168
  209. 30508
  210. 42473
  211. 17658
  212. 6951
  213. 16072
  214. 29835
  215. 45965
  216. 50521
  217. 11761
  218. 29223
  219. 30708
  220. 26856
  221. 5529
  222. 47348
  223. 51568
  224. 20825
  225. 16546
  226. 41207
  227. 44160
  228. 45707
  229. 22552
  230. 30279
  231. 24665
  232. 53590
  233. 13101
  234. 51131
  235. 46944
  236. 23595
  237. 9658
  238. 22813
  239. 51785
  240. 18601
  241. 11702
  242. 46171
  243. 16331
  244. 1949
  245. 52149
  246. 7072
  247. 52882
  248. 37126
  249. 29531
  250. 49427
  251. 31333
  252. 31974
  253. 13521
  254. 10301
  255. 511
  256. 43348
  257. 45853
  258. 36776
  259. 28875
  260. 28016
  261. 23524
  262. 20462
  263. 12687
  264. 33096
  265. 42493
  266. 2923
  267. 48409
  268. 16963
  269. 23548
  270. 3193
  271. 38638
  272. 10908
  273. 33182
  274. 18506
  275. 4100
  276. 51309
  277. 43714
  278. 19103
  279. 1834
  280. 11371
  281. 26563
  282. 38551
  283. 31089
  284. 37108
  285. 36157
  286. 51273
  287. 10158
  288. 39599
  289. 24471
  290. 4701
  291. 5952
  292. 1207
  293. 13016
  294. 44236
  295. 36159
  296. 41102
  297. 26905
  298. 53775
  299. 28839
  300. 31161
  301. 920
  302. 36791
  303. 30455
  304. 50694
  305. 10954
  306. 28555
  307. 29621
  308. 44173
  309. 2098
  310. 34491
  311. 7635
  312. 1509
  313. 13285
  314. 48841
  315. 10683
  316. 23961
  317. 44092
  318. 14192
  319. 23234
  320. 43914
  321. 17748
  322. 13136
  323. 5609
  324. 46550
  325. 36682
  326. 9233
  327. 36579
  328. 30176
  329. 50318
  330. 8287
  331. 48443
  332. 42225
  333. 14498
  334. 49797
  335. 9242
  336. 27143
  337. 13849
  338. 8981
  339. 31983
  340. 13406
  341. 40115
  342. 50383
  343. 50872
  344. 26437
  345. 24278
  346. 37117
  347. 38036
  348. 31442
  349. 50961
  350. 15786
  351. 9040
  352. 34114
  353. 41815
  354. 42464
  355. 32692
  356. 21259
  357. 21830
  358. 40813
  359. 6895
  360. 23076
  361. 34311
  362. 25068
  363. 49246
  364. 49708
  365. 4714
  366. 51002
  367. 6520
  368. 13441
  369. 29742
  370. 12430
  371. 48201
  372. 10195
  373. 46776
  374. 12021
  375. 44666
  376. 26588
  377. 31411
  378. 1791
  379. 10113
  380. 16906
  381. 24914
  382. 4185
  383. 45665
  384. 46295
  385. 14442
  386. 46661
  387. 43820
  388. 38558
  389. 1420
  390. 30326
  391. 31506
  392. 1318
  393. 8804
  394. 37569
  395. 29453
  396. 39196
  397. 28920
  398. 19964
  399. 52188
  400. 38859

And then we use [] to extract those rows from the data frame.

hurrdata2 <- hurrdata[sample(nrow(hurrdata), 1000), ]

To Get Support: - Please email tts-research@tufts.edu for questions and requests.