This post aims to share with you how to import a text document into R and at the same time extracting each row into column of a data.frame.
This post aims to share with you how to import a text document into R and at the same time extracting each row into column of a data.frame.
Screenshot below show how the original text document look like.
In this post, we are going to learn how to extract the text and capture them into columns look similar to the screenshot below
Two R packages from the tidyverse family will be used to accomplish the task, they are: stringr and tidyr.
The code chunk below will be used install and launch the packages in RStudio.
packages = c('stringr', 'tidyr')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Before getting start, it will be wiser to tidy the text document as shown below
The main changes are as follows:
Now, we are ready to import the text file into R.
First, lapply() of Base R is used to import the text document (i.e. 77.txt) into R as a list object called textlist.
textlist <- lapply("data/77.txt", readLines)
The code chunk below can be used to check the newly created object.
textlist
[[1]]
[1] "SOURCE: All News Today"
[2] "TITLE: A LOOK BACK AT A LIFE CUT TRAGICAL SHORT"
[3] "AUTHOR: ELIAN KAREL "
[4] "PUBLISHED: 2011/06/21"
[5] "LOCATION: ABILA, Kronos"
[6] "TEXT: Two years ago yesterday Elian Karel died at age 28 in an Abila jail cell, purportedly of cardiac arrest. Kraft had reached an agreement with the government that charges would be reduced and the Karel be released pending a trial, but Karel died two days before he was to go free. Questions linger, however, about the manner of Karel's death. Requests for information were met with assurances that the police, whose custody Karel was in at the time of his death, would do everything possible to perform an exhaustive investigation. After Karel's body was cremated and sent to his family in Elodis, the Abila Chief of Police closed the case and has declined to answer questions about Karel's death stating that \"we are satisfied the death was of natural causes and will no longer be entertaining inquiries.\" As reported at the time of Karel's death, several people close to the investigation reported that Karel's body showed signs of blunt force trauma, abrasions and lacerations which were not consistent with the cause of death reported by the police. Police have denied anything unusual happened in Karel's death and city officials claim the accusations of murder and wrongful death are asserted by POK rabble-rousers attempting to incite instability and violence. Yesterday on the morning of June 19th, a small group of POK supporters gathered in front of the Abila Police Station, holding photographs of Elian, and of another young martyr, Juliana Vann, the ten-year old girl who died in 1998 from cancer caused by benzene toxins in her drinking water. \"Both of them died because of government lies and corruption,\" one man told me who asked that his name be withheld because of concerns of police retaliation. \"Juliana died because Kronos allowed GAStech to poison our water, and Elian died because he tried to bring Juliana justice.\""
The code chunk below is used to convert each element in the newly created list into data.frame. The output will replaced the initial textlist object.
textlist <- lapply(1:length(textlist),
function(i) data.frame(
caseno=i,
rawdata=textlist[[i]],
stringsAsFactors = FALSE))
You can examine the structure of textlist object by click on it’s name on the Data display panel.
Notice that the object is a data.frame consisting 2 columns and 6 rows as shown in the figure below.
The code chunk below will be used to bind the list object into a data.frame.
df <- do.call(rbind, textlist)
The code chunk below is then used to split the rawdata column into two columns at the first ‘:’. Two functions of stringr package, namely: str_trim() and str_split_fixed() are used to complete the task.
df[,c("type","entry")] <- str_trim(str_split_fixed(df$rawdata,":",2))
Notice that the newly created two columns are called type and entry as shown in the screenshot below
Lastly, the pivot_wider() of tidyr package will be used to transpose the two columns into multiple columns data.frame as shown in the screenshot below.
df <- df[,c("caseno","type","entry")]
df <- pivot_wider(df,
names_from = type,
values_from = entry)