OrgPad logo

Data Analysis in R for Students of Humanities

Created by Silvie Cinková

Data Analysis in R for Students of Humanities

Konzultační hodiny Silvie Cinková

STŘEDA 16:15 - 17:45

PÁTEK 13:45 - 15:15

Na kanálu Erko 

Practical

"Code of conduct"

Welcome!

This is a course for REAL beginners. If you have coded before, it is ultimately going to be too slow for you. If you have not, this course is going to be very demanding for you. IT IS  EXACTLY YOU THIS COURSE IS FOR. Never surrender to frustration, if everybody seems brighter than you. If they do, they are probably false beginners and have already gone through the very hell before you! 

Like at a yoga studio, here is no place for your ego. Stay on your mat, be polite and discrete to others. Concentrate on yourself understanding the matter. 

Interrupt teacher. Ask. ASK!!! ASK!!! There is nothing like a silly question. Often you don't understand because the teacher's presentation is confusing, or the stuff is just not logical. A particularly good question is: "Could you repeat what you said since this...."? There were FOR SURE others who should have asked, but you were the only one who dared.    

Let the competence seep in with time. The next two months, you will hardly believe you can learn it. You may be doing poorly the entire term, but if you don't stop, the thing will give in.  

 

 

 

Silvie Cinková is inviting you to a scheduled Zoom meeting.

Topic: Erko 2021

Time: Feb 26, 2021 09:00 AM Budapest

Join Zoom Meeting

https://cuni-cz.zoom.us/j/96459184606?pwd=UURobkxmTUVSTHB5Q09TVEEvYUI5dz09

Meeting ID: 964 5918 4606

Passcode: erko

//////////////

Silvie's personal room on Zoom

https://cuni-cz.zoom.us/j/9293605462?pwd=cnl5MkdURHBMK0R1R0Y3aHExUGJ2dz09

Meeting ID: 929 360 5462

Passcode: CzADH

Fixed consultation hours on Zoom (drop me an e-mail or text/call 723 464 449 if Zoom happens to be off)

Tentative plan: 2 fixed terms a week

Please vote here:

https://doodle.com/poll/msyrtv3cnu7gfz6m?utm_source=poll&utm_medium=link

Zoom channel "Erko"

You have been invited. Try to join or tell Silvie to add you, if you think you have not been invited. To be allowed to join the channel, you MUST have a zoom account (even a free subscription is ok). The invitation must must match the e-mail address associated with your zoom account.

 

image

image

image

image

image

Navigation

 

Courses

not only R!

You have a 6-month free access to the entire content of DataCamp!

After that, you are going to fall back to free access (almost nothing accessible), but your account with all your earned xp points will not disappear (at least not immediately). 

Welcome word, videos https://stream.cuni.cz, tag #RHum

2 parts: Welcome word + Data Science in Digital Humanities 

Videos might require a shibboleth login at https://stream.cuni.cz.

 


 

 

 

 

1. Intro

2. RStudio, Vectors

Objectives:

 

Exercises

Additional random exercises to many courses if you think you need to train more. 

Listen & Read

Community support on Slack

Icon in the upper right corner

R Big Picture video https://stream.cuni.cz, tag #RHum

https://www.orgpad.com/s/Xc0Z4xJQjxN

Video in Czech below. If this does not work, you will have to log in at https://stream.cuni.cz,  using CUNI Shibboleth (only available to CUNI affiliates). If you are logged in there, try this link directly: https://stream.cuni.cz/cs/Detail/11892 (that should take you directly to the video)

Exercise

Most courses also on mobile!

DataCamp

data types

... and some others, not important here

https://campus.datacamp.com/courses/free-introduction-to-r/chapter-1-intro-to-basics-1?ex=1

Tools for this course

Homework for next session

finish 

https://campus.datacamp.com/courses/free-introduction-to-r/chapter-2-vectors-2?ex=1

Stack Overflow

When you dont know how to code something, go look or even ask at https://stackoverflow.com/

Zoom

vectors

https://campus.datacamp.com/courses/free-introduction-to-r/chapter-2-vectors-2?ex=1

Help in RStudio

Type ? followed with the name of function you want to know more about, e.g. ?tolower

 

whiteboard.fi

RStudio on Jupyter

https://jupyter.korpus.cz/r/auth-sign-in

The first thing for you to see: 

image

Run a script in RStudio on Jupyter

library(tidyverse)

library(stringi)

radky <- c("rozkošný chlapeček", "koště a pytel hraček",

"1 oko štěstí čeká", "sošný čekatel")

stringr::str_detect(string = radky, pattern = "[^k]oš.+ček")

#co ten vyraz najde?

stringr::str_extract(string = radky, pattern = "[^k]oš.+ček")

Exercise

Read plots

Experiment with interactive plots

Gapminder.org: Life expectancy vs. income

countries health wealth 2016 v151

logolink OP VVV hor barva eng

online book ggplot2 Elegant Graphics for Data Analysis by H. Wickham

4. Summarization, Groups, Pipe

5. Visualization with ggplot2 I.

Big picture, Data, Aesthetics, scale_nn_

6. Visualization with ggplot2 / Geoms

7. Visualization with ggplot2 III./ STATS

8. Visualization with ggplot2 IV. / Facets, Coords, Themes

9. "join" two dataframes with dplyr; toggle long and wide table format with tidyr

10. Import files, save files

12. Text Mining teaser

11. String operations (find/replace in texts) with stringr

The End

chlora spi na monitoru

3. Factors, Data Frames

Homework for the next session

Warm-up Exercises

Warm-up Exercises

exercises_02.Rmd

exercises_03.nb.html

exercises_03.Rmd

exercises_01.html

exercises_01.nb.html

exercises_01.Rmd

exercises_02.nb.html

Listen & Read

Homework for the next session

https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2

TWO WEEKS' TIME, NEXT SESSION COMES ON EASTER

Listen & Read

Listen & Read

Listen & Read

Listen & Read

Listen & Read

Homework for next session

Factors

https://campus.datacamp.com/courses/free-introduction-to-r/chapter-4-factors-4?ex=1

 

Data Frames

https://campus.datacamp.com/courses/free-introduction-to-r/chapter-5-data-frames?ex=1

Homework for the next session

Intro to dplyr: https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling-1?ex=1

Grouping and summarizing in dplyr: https://campus.datacamp.com/courses/introduction-to-the-tidyverse/grouping-and-summarizing?ex=1

Intro to ggplot2: https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-visualization?ex=1

Exercises

  1. Load the gapminder library. Look at the gapminder data set.
  2. Return a vector of all European countries in the gapminder data.
  3. What is the lowest life expectancy in the data? Which year, which country was it?
  4. Look at how the country was doing through the years.
  5. Data is missing for some years. What do you think, was the life expectancy in this country higher or lower? What was happening there? (This is obviously not a programming question!!! Just a scholarly bonus)

image

image

Test whether something is empty value (NA)

1. How would you best test which values in Vector my_vector are NA? ( Single Choice)

Answer 1: my_vector == NA

Answer 2: my_vector = NA

Answer 3: is.na(my_vector)

correct: is.na(my_vector)

 

my_vector == NA could occasionally work but never do  that. 

Listen & Read

Homework for the next session

https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2

TWO WEEKS' TIME, NEXT SESSION COMES ON EASTER

Listen & Read

import files

Video

https://stream.cuni.cz/gotoken/d6dac592ad48406d9471dd86bf9205c9

 

 

 

 

Text mining presentation

Text_Mining.pptx

Warm-Up Exercises

Warm-up Exercises

A few tricks with data frames in base R

Paper about the Czech Text Document Corpus

In this lecture, we worked with the dataset described in this paper. The students did not have to read the paper before the lecture, but it may be useful or interesting. 

http://www.lrec-conf.org/proceedings/lrec2018/pdf/671.pdf

Listen & Read

The script is on Jupyter in Home/edu/r/2021DataSciEdu/Session3

and for download hereSession3.zip

It includes some of the poll exercises, too.

Listen and Read

ggplot2 - Big Picture

https://www.orgpad.com/s/lyr3VmxQMxg

Listen & Read

ORGPAD_GEOMS2.nb.html

ORGPAD_GEOMS2.Rmd

unknownORGPAD_GEOMS2.Rmd

dplyr::xxxx_join Video

https://stream.cuni.cz/gotoken/d7b9013c068946858f42e41705965694

Homework for the next session

https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2

STATS Chapter from Intermediate visualization with ggplot2, to April 16

source script to video

dplyr_join01.Rmd

dplyr_join01.nb.html

01 dve tabky

joins

tidyr - wide, long, separate, unite

https://stream.cuni.cz/gotoken/0eaaa7183d7e402b9c9f6eafe75dfa4b

read tables

group - the mysterious "invisible" aesthetic scale; grouping

read a plain text file

Source script and files

manage_files.zip

stringr - selection of functions

Polls on Zoom

Getting to know RStudio

Presentation - Vector types, subsetting, element replacement

Polls on Zoom

Poll 5 - after the Second lesson:

How do you feel about the pace?

Randomize row order in a data frame

image

How to manipulate rows and columns with dplyr

image

Note how arguments are written inside these functions: seldom in quotes and hardly ever in vectors. This would not be possible in most base-R functions!

group_by

ggplot2 cheatsheet

data-visualization-2.1.pdf

ggplot2-Big Picture (another orgpage)

https://stream.cuni.cz/gotoken/e6e6e90b1fd740ba8223570e35ab1b8b

Video theme

https://stream.cuni.cz/gotoken/000b1f40ed194fe9a63a506450a1b775

Video on presentation

https://stream.cuni.cz/gotoken/cc5b065910d644f8a7dd7ec69a3696ed

 

Poll 6: Subsetting iris

image

image

Video and source code

pipe.R

 

library(tidyverse)

library(magrittr)

library(gapminder)

#Pipe

#More readable code

#- avoids nested functions and too many variables between the steps

#- makes it easy to add steps in the middle of the code

#Magrittr offers several pipe types, but to use the basic one %>%, you do not

#have to load magrittr. It is included in dplyr.

#Too many variables Example

my_gap <- gapminder

gap_asia44 <- filter(gapminder, continent == "Asia", year == "1952")

gap_asia44 <- slice_max(gap_asia44, order_by = pop, n = 5) #gives rows with 5 highest values of that variable

#Nested functions Example (you need fewer variables, but...)

top5pop2 <- slice_max(filter(gapminder, continent == "Asia",

year == "1952"),

order_by = pop, n = 5)

#Pipe

top5pop2 <- gapminder %>%

filter(continent == "Asia", pop > mean(.$pop))

Stats - geoms to render summarization stats

Resolving the poll exercises in RStudio

Rename columns

image

Pipe %>%

MagrittePipe

 

 

 

 

helper function to count observations: n()

Source script

ORGPAD_GEOMS.Rmd

 

source script to video

facets_coord_themes_shortcuts.Rmd

source script to video

tidyr.nb.html

tidyr.Rmd

Add rows to a data frame / tibble

1. Which function(s) would you use to add rows to a data frame/tibble? (look them up in Help if you don't know) ( Multiple Choice)

Answer 1: rbind()

Answer 2: dplyr::bind_rows()

Answer 3: cbind()

Answer 4: row.names()

Correct 1 and 2

Video - ggplot 01

source script

ORGPAD_GEOMS2.nb.html

ORGPAD_GEOMS2.Rmd

source script to video

Stats_stats.Rmd

Stats_stats.nb.html

Video with the text mining code

https://stream.cuni.cz/gotoken/99de0800e06c44fc8f6fda78dbb64feb

files

tidytext_01.Rmd

wholecorpus.rds

wholecorpus.rds

ekolabels50 ggwordcloud

machinery50 ggwordcloud

politics50 ggwordcloud plot

bigram graph

Test z vektorů 1 - clues on Page 2 each question, mostly Czech

Source scripts shown in the presentations

vectors_wrapup.zip

Poll 7: Iris again

image

image

image

 

 

 

 

 

video

Refer to weird column names

image

Rows

Columns

Video

select: Select columns according to positions or names

image

mutate: How to generate columns with new values or modify existing ones

 

ggplot2-Big Picture (another orgpage)

https://www.orgpad.com/s/lyr3VmxQMxg

 

source script to video

Stats_grouping.Rmd

Stats_grouping.nb.html

video

Video

https://stream.cuni.cz/gotoken/502bf2bb185b41f59e908a75982573aa

vector subsetting English, clues Page 2!

image

image

Relational operators

Relational operators you have already learned on DataCamp and one more added

 

image

Iris data set

imageuntitled

,

 

Poll 8: Factors suck 1

image

summarize

 

 

str_view()

Libraries used here for the first time

Concepts

 

1

image

image

Source Scripts to Polls Exercises

filter: Select rows according to values in columns

image

video

 

Download as .R file

source_poll2_exercises.R

Poll 9: Factors suck 2

image

image

slice: Select rows according to position

image

files

2

image

image

Copy & Paste

 

# Poll 2, Exercise 1

a <- c("a", "b", "c", TRUE, FALSE)

a <- c(1, "10") ##

a <- c("TRUE", "FALSE", 3, False)

a <- c(3 5 10)

#Poll 2, Exercise 2

a <- c(3:5)

b <- c(22, 24, 100)

mix <- c(a,b,a)

mix

arrange: Reorder rows according to values in columns

image

slice_sample (formerly sample_n): Select random rows or randomize the entire data frame

image

Rename columns

image

Reorder columns

image

Big Task! Add values where they are missing.Copy this script to a new RNotebook and finish it.

R eplace the NA values in `my_tibble$first_name` with the elements of the vector `missing_fistnames`.

```{r}

library(tidyverse)

library(gapminder)

```

Data frame

```{r}

my_tibble <- tibble(first_name = c("Božena", NA, NA, "Karolina", NA),

surname = c("Němcová", "Seifert", "Vančura", "Světlá", "Neruda"))

missing_firstnames <- c("Jaroslav", "Vladimír", "Jan") #You can trust that

# they are in correct order

```

```{r}

my_tibble

```

Now, replace the NA values in `my_tibble$first_name` with the elements of the vector `missing_fistnames`.

---

title: "SOLUTION Add values where they are missing"

output: html_notebook

---

```{r}

library(tidyverse)

library(gapminder)

```

Data frame

```{r}

my_tibble <- tibble(first_name = c("Božena", NA, NA, "Karolina", NA),

surname = c("Němcová", "Seifert", "Vančura", "Světlá", "Neruda"))

missing_firstnames <- c("Jaroslav", "Vladimír", "Jan") #You can trust that

# they are in correct order

```

```{r}

my_tibble

```

```{r}

missing_firstnames

```

#How would you insert these first_names into `my_tibble`?

You do not have to care how the rows will be arranged.

## First solution

```{r}

my_tibble$first_name[which(is.na(my_tibble$first_name))] <- missing_firstnames

```

```{r}

my_tibble

```

## Second solution

Data frame

```{r}

my_tibble <- tibble(first_name = c("Božena", NA, NA, "Karolina", NA),

surname = c("Němcová", "Seifert", "Vančura", "Světlá", "Neruda"))

missing_firstnames <- c("Jaroslav", "Vladimír", "Jan") #You can trust that

# they are in correct order

```

```{r}

toaddnames <- my_tibble %>% filter(is.na(first_name))

namesok <- my_tibble %>% filter(!is.na(first_name))

toaddnames$first_name <- missing_firstnames

my_tibble2 <- dplyr::bind_rows(namesok, toaddnames)

```

guided example - a script to suggest a nice first name to sex and surname for a baby

child naming script video - see before cleaning the data video

3

image

image

Poll 10: Add column to a data frame

Compute the ratio Sepal.Length to Petal.Length (divide Sepal.Length by Petal.Length) and add the result as a new column called Length.Ratio

image

image

For fun: a thread about dropping unused factor levels on Stack Overflow

https://stackoverflow.com/questions/1195826/drop-unused-factor-levels-in-a-subsetted-data-frame#4284931

Drop unused factor levels in a subsetted data frame

Asked11 years, 7 months ago
Active2 months ago
Viewed 413k times
 
567

176

I have a data frame containing a factor. When I create a subset of this dataframe using subset or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels, even when/if they do not exist in the new dataframe.This causes problems when doing faceted plotting or using functions that rely on factor levels.What is the most succinct way to remove levels from a factor in the new dataframe?Here's an example:
df <- data.frame(letters=letters[1:5],                     numbers=seq(1:5))  levels(df$letters) ## [1] "a" "b" "c" "d" "e"  subdf <- subset(df, numbers <= 3) ##   letters numbers## 1       a       1## 2       b       2## 3       c       3    # all levels are still there! levels(subdf$letters) ## [1] "a" "b" "c" "d" "e"
r dataframe r-factor r-faqShareImprove this question Follow edited Jun 29 '20 at 23:26Henrik55.3k1212 gold badges124124 silver badges138138 bronze badges asked Jul 28 '09 at 18:21medriscoll24.4k1616 gold badges3535 silver badges3636 bronze badgesAdd a comment

15 Answers

 
439
 
All you should have to do is to apply factor() to your variable again after subsetting:
> subdf$letters [1] a b c Levels: a b c d e subdf$letters <- factor(subdf$letters) > subdf$letters [1] a b c Levels: a b c
EDITFrom the factor page example:
factor(ff)      # drops the levels that do not occur
For dropping levels from all factor columns in a dataframe, you can use:
subdf <- subset(df, numbers <= 3) subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x) 
ShareImprove this answer Follow edited Nov 25 '16 at 17:37 answered Jul 28 '09 at 22:41hatmatrix36.1k3737 gold badges125125 silver badges215215 bronze badges
  • 22That's fine for a one-off, but in a data.frame with a large number of columns, you get to do that on every column that is a factor ... leading to the need for a function such as drop.levels() from gdata. – Dirk EddelbuettelJul 29 '09 at 14:16
  • 6I see... but from a user-perspective it's quick to write something like subdf[] <- lapply(subdf,function(x) if(is.factor(x)) factor(x) else x) ...Is drop.levels() much more efficient computationally or better with large data sets? (One would have to rewrite the line above in a for-loop for a huge data frame, I suppose.) – hatmatrixJul 29 '09 at 17:09
  • 1Thanks Stephen & Dirk - I'm giving this one the thumbs up for the caes of one factor, but hopefully folks will read these comments for your suggestions on cleaning up an entire data frame of factors. – medriscollJul 30 '09 at 4:18
  • 9As a side-effect the function converts the data frame to a list, so the mydf <- droplevels(mydf) solution suggested by Roman Luštrik and Tommy O'Dell below is preferable. – JohanMay 9 '14 at 10:41
  • 1Also: this method does preserve the ordering of the variable. – webeloJul 1 '16 at 0:36
Show 1 more comments
 
505
Since R version 2.12, there's a droplevels() function.
levels(droplevels(subdf$letters)) 
ShareImprove this answer Follow answered Nov 26 '10 at 11:37ao2PmRoman Luštrik63.6k2424 gold badges141141 silver badges187187 bronze badges
  • 8An advantage of this method over using factor() is that it's not necessary to modify the original dataframe or create a new persistent dataframe. I can wrap droplevels around a subsetted dataframe and use it as the data argument to a lattice function, and groups will be handled correctly. – MarsNov 21 '15 at 5:44
  • I've noticed that if I have an NA level in my factor (a genuine NA level), it is dropped by dropped levels, even if the NAs are present. – MeepJul 5 '16 at 0:48
Add a comment
 
47
If you don't want this behaviour, don't use factors, use character vectors instead. I think this makes more sense than patching things up afterwards. Try the following before loading your data with read.table or read.csv:
options(stringsAsFactors = FALSE) 
The disadvantage is that you're restricted to alphabetical ordering. (reorder is your friend for plots)ShareImprove this answer Follow answered Jul 28 '09 at 23:53hadley93.3k2626 gold badges169169 silver badges237237 bronze badgesAdd a comment
 
39
It is a known issue, and one possible remedy is provided by drop.levels() in the gdata package where your example becomes
> drop.levels(subdf)   letters numbers 1       a       12       b       23       c       3 > levels(drop.levels(subdf)$letters) [1] "a" "b" "c"
There is also the dropUnusedLevels function in the Hmisc package. However, it only works by altering the subset operator [ and is not applicable here.As a corollary, a direct approach on a per-column basis is a simple as.factor(as.character(data)):
> levels(subdf$letters) [1] "a" "b" "c" "d" "e" > subdf$letters <- as.factor(as.character(subdf$letters)) > levels(subdf$letters) [1] "a" "b" "c"
ShareImprove this answer Follow edited Jul 28 '09 at 19:04 answered Jul 28 '09 at 18:37Dirk Eddelbuettel327k4949 gold badges591591 silver badges670670 bronze badges
  • 5The reorder parameter of the drop.levels function is worth mentioning: if you have to preserve the original order of your factors, use it with FALSE value. – daroczigJan 17 '11 at 11:31
  • Using gdata for just drop.levels yields "gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED." "gdata: Unable to load perl libaries needed by read.xls()" "gdata: to support 'XLSX' (Excel 2007+) files." "gdata: Run the function 'installXLSXsupport()'" "gdata: to automatically download and install the perl". Use droplevels from baseR (stackoverflow.com/a/17218028/9295807) – VrokipalJun 20 '18 at 19:12
  • Stuff happens over time. You are commenting on an answer I wrote nine years ago. So let's take this as a hint to generally prefer base R solutions as those are the ones using functionality that is still going to be around N years from now. – Dirk EddelbuettelJun 20 '18 at 19:21
Add a comment
 
26
Another way of doing the same but with
dplyrlibrary(dplyr) subdf <- df %>% filter(numbers <= 3) %>% droplevels() str(subdf) 
Edit: Also Works ! Thanks to agenis
subdf <- df %>% filter(numbers <= 3) %>% droplevels levels(subdf$letters) 
ShareImprove this answer Follow edited May 23 '17 at 12:18Community111 silver badge answered Jul 15 '15 at 11:14Prradep4,59933 gold badges3333 silver badges6464 bronze badgesAdd a comment
 
17
For the sake of completeness, now there is also fct_drop in the forcats package http://forcats.tidyverse.org/reference/fct_drop.html.It differs from droplevels in the way it deals with NA:
f <- factor(c("a", "b", NA), exclude = NULL)  droplevels(f) # [1] a    b    <NA># Levels: a b <NA>  forcats::fct_drop(f) # [1] a    b    <NA># Levels: a b
ShareImprove this answer Follow answered Jun 12 '17 at 9:44YMbARAurèle9,81611 gold badge2525 silver badges4343 bronze badgesAdd a comment
 
15
Here's another way, which I believe is equivalent to the factor(..) approach:
> df <- data.frame(let=letters[1:5], num=1:5) > subdf <- df[df$num <= 3, ]  > subdf$let <- subdf$let[ , drop=TRUE]  > levels(subdf$let) [1] "a" "b" "c"
ShareImprove this answer Follow answered Jul 29 '09 at 3:40ars104k2121 gold badges133133 silver badges130130 bronze badges
  • Ha, after all these years I didn't know there is a `[.factor` method that has a drop argument and you've posted this in 2009... – David ArenburgFeb 13 '19 at 15:42
  •  
Add a comment
 
8
This is obnoxious. This is how I usually do it, to avoid loading other packages:
levels(subdf$letters)<-c("a","b","c",NA,NA) 
which gets you:
> subdf$letters [1] a b c Levels: a b c
Note that the new levels will replace whatever occupies their index in the old levels(subdf$letters), so something like:
levels(subdf$letters)<-c(NA,"a","c",NA,"b") 
won't work.This is obviously not ideal when you have lots of levels, but for a few, it's quick and easy.ShareImprove this answer Follow edited Jul 28 '09 at 19:06 answered Jul 28 '09 at 18:4462uiVMatt Parker24.3k66 gold badges5151 silver badges7070 bronze badgesAdd a comment
 
8
Looking at the droplevels methods code in the R source you can see it wraps to factor function. That means you can basically recreate the column with factor function. Below the data.table way to drop levels from all the factor columns.
library(data.table) dt = data.table(letters=factor(letters[1:5]), numbers=seq(1:5)) levels(dt$letters) #[1] "a" "b" "c" "d" "e" subdt = dt[numbers <= 3] levels(subdt$letters) #[1] "a" "b" "c" "d" "e"  upd.cols = sapply(subdt, is.factor) subdt[, names(subdt)[upd.cols] := lapply(.SD, factor), .SDcols = upd.cols] levels(subdt$letters) #[1] "a" "b" "c"
ShareImprove this answer Follow edited Mar 11 '16 at 23:30 answered Dec 9 '15 at 14:568UW7xjangorecki13.5k33 gold badges5656 silver badges136136 bronze badges Add a comment
 
7
here is a way of doing that
varFactor <- factor(letters[1:15]) varFactor <- varFactor[1:5] varFactor <- varFactor[drop=T] 
ShareImprove this answer Follow edited Nov 9 '14 at 10:03PNguNDavid Arenburg86.6k1515 gold badges122122 silver badges179179 bronze badges answered Jan 31 '14 at 3:25Diogo73422 gold badges88 silver badges1414 bronze badges
  •  
Add a comment
 
6
I wrote utility functions to do this. Now that I know about gdata's drop.levels, it looks pretty similar. Here they are (from here):
present_levels <- function(x) intersect(levels(x), x)  trim_levels <- function(...) UseMethod("trim_levels")  trim_levels.factor <- function(x)  factor(x, levels=present_levels(x))  trim_levels.data.frame <- function(x) {   for (n in names(x))     if (is.factor(x[,n]))       x[,n] = trim_levels(x[,n])   x } 
ShareImprove this answer Follow answered Sep 1 '09 at 20:37Brendan OConnor8,95833 gold badges2424 silver badges2424 bronze badgesAdd a comment
 
4
Very interesting thread, I especially liked idea to just factor subselection again. I had the similar problem before and I just converted to character and then back to factor.
   df <- data.frame(letters=letters[1:5],numbers=seq(1:5))    levels(df$letters)    ## [1] "a" "b" "c" "d" "e"    subdf <- df[df$numbers <= 3]    subdf$letters<-factor(as.character(subdf$letters)) 
ShareImprove this answer Follow answered May 25 '15 at 12:08p1ZpaDfAC35522 silver badges66 bronze badges
  • I mean, factor(as.chracter(...)) works, but just less efficiently and succinctly than factor(...). Seems strictly worse than the other answers. – Gregor ThomasFeb 13 '19 at 15:47
Add a comment
 
1
Unfortunately factor() doesn't seem to work when using rxDataStep of RevoScaleR. I do it in two steps: 1) Convert to character and store in temporary external data frame (.xdf). 2) Convert back to factor and store in definitive external data frame. This eliminates any unused factor levels, without loading all the data into memory.
# Step 1) Converts to character, in temporary xdf file: rxDataStep(inData = "input.xdf", outFile = "temp.xdf", transforms = list(VAR_X = as.character(VAR_X)), overwrite = T) # Step 2) Converts back to factor: rxDataStep(inData = "temp.xdf", outFile = "output.xdf", transforms = list(VAR_X = as.factor(VAR_X)), overwrite = T) 
ShareImprove this answer Follow answered Jan 28 '19 at 21:10uZpLVJerome Smith4155 bronze badgesAdd a comment
 
1
Have tried most of the examples here if not all but none seem to be working in my case. After struggling for quite some time I have tried using as.character() on the factor column to change it to a col with strings which seems to working just fine.Not sure for performance issues.ShareImprove this answer Follow answered Sep 2 '19 at 3:20Naga Pakalapati12311 silver badge55 bronze badgesAdd a comment
 
0
A genuine droplevels function that is much faster than droplevels and does not perform any kind of unnecessary matching or tabulation of values is collapse::fdroplevels. Example:
library(collapse) library(microbenchmark)  # wlddev data supplied in collapse, iso3c is a factor data <- fsubset(wlddev, iso3c %!in% "USA")  microbenchmark(fdroplevels(data), droplevels(data), unit = "relative") ## Unit: relative##               expr  min       lq     mean   median       uq      max neval cld##  fdroplevels(data)  1.0  1.00000  1.00000  1.00000  1.00000  1.00000   100  a ##   droplevels(data) 30.2 29.15873 24.54175 24.86147 22.11553 14.23274   100   b
ShareImprove this answer Follow answered Jan 9 at 8:45jfv91Sebastian35422 silver badges88 bronze badgesAdd a comment

 

Highly active question. Earn 10 reputation in order to answer this question. The reputation requirement helps protect this question from spam and non-answer activity.

Poll 12: reorder rows in a data frame

Order the rows of iris according to Petal.Length, in ascending order. How did/would you do it?

image

image

slice_max (formerly top_n): Select rows with n highest values of a variable

image

data set: a chart of popular Czech first names from 2019

Videos https://stream.cuni.cz, tag #RHum

 

 

 

 

Cvičení 2

Poll 11: Select columns in a data frame

Remove the Sepal.Length and Petal.Length columns from iris 

image

image

source files

pokusy.Rmdstrings001.Rmdstrings01.Rmd, 

strings02.Rmd, jmenadeti_2019.xlsxmoje_jmena.tsv                      or all in one zipped archive here: stringr.zip

vzortabka 

 

Future

Video - see after the child naming script video

https://stream.cuni.cz/gotoken/5b0fbad8173843c0a70ce0c72bd6d593

 

dplyr helper functions and stringr